Correction: Semantic and ontology-based analysis of regulatory documents for construction industry digitalization

Not yet reviewed

Publication

AuthorsZarina Kabzhan, A. L. SHAKHNOVICH, Sergey Gorshkov, Yussuf Yemenov, Fedor Gorshkov, Nazym Shogelova

Year2025

JournalFrontiers in Built Environment

DOI10.3389/fbuil.2025.1624950

Abstract

3Methods and MaterialsFor the purpose of formalization and automated analysis of regulatory requirements applied in the construction industry, this study proposes a methodology based on the use of ontological modeling, natural language processing (NLP) techniques, and semantic analysis (Chen et al., 2024).The domain of interest is the set of building codes and regulations adopted in the Republic of Kazakhstan. The study is grounded in the following key principles.First, regulatory documents possess a complex, multi-level structure that includes nested conditions, exceptions, cross-references, and hierarchically organized requirements.Accordingly, the proposed methodology incorporates a step-by-step analysis of the syntactic and semantic characteristics of regulatory statements, aimed at their subsequent formalization and alignment.At the first stage, preliminary linguistic processing of the texts is performed, including tokenization, lemmatization, part-of-speech (POS) tagging, dependency parsing, and coreference resolution. These procedures are carried out using the DataVera EKG Language Processing (EKG LP) software module (DataVera, 2025), which is built on the SpaCy library and adapted to the specifics of regulatory vocabulary.At the second stage, textual fragments are aligned with the ontological model, which is represented as a set of interconnected ontologies (fig.1):Upper-level ontology (based on BFO), used to represent universal categories such as objects, processes, and relationships;Domain ontology of the construction sector (based on IFC), covering capital construction assets, engineering systems, and life cycle processes;Regulatory statement ontology, based on deontic logic, describing the structure of norms (subject, modality, action, object, and applicability condition);Terminology ontology (SKOS model), providing linkage between the terms used in regulatory documents and the concepts of the domain ontology. Fig. 1. Relationship between elements of the proposed ontologiesThe formalized representation of regulatory provisions is carried out in the form of semantic profiles, which include the following elements: subject (addressee of the requirement), modality (obligation, possibility, prohibition), predicate (action or characteristic), object (result of the action), as well as additional attributes (conditions, exceptions, time frames, etc.).To account for the complex structure of regulatory texts, the methodology implements mechanisms for:Detection of nested conditions (through the analysis of syntactic structures and conditional operators);Processing of exceptions, formed through negation constructs or limitations on the scope of regulations;Reconstruction of hierarchical relationships between regulatory provisions, using structural markers and contextual analysis of headings, articles, and subsections.At the final stage, a comparative semantic analysis is performed, aimed at identifying:Duplicated provisions (when key elements of the semantic profile match);Contradictions (when there are discrepancies in modalities or conditions of application);Semantic inconsistencies (in definitions of terms and interpretations of concepts).The comparison of semantic profiles is carried out based on a calculated similarity metric, the threshold value of which is determined empirically. In the case of significant discrepancies, the corresponding fragments are forwarded for expert review. Fig. 2. Architecture of the automated system for processing regulatory document textsThe developed system is designed for the automated semantic analysis of regulatory documents, identifying contradictions, duplicated provisions, and semantic inconsistencies. The architectural solution (Fig. 2) is based on the use of ontological models, graph and relational databases, as well as natural language processing (NLP) methods.The system includes several key components that ensure its functionality. A graph-based RDF triple store database (Apache Fuseki) is used for storing ontological models, enabling complex semantic queries and analysis of relationships between concepts. A relational or document-oriented storage system (PostgreSQL) is employed to store the results of the linguistic analysis of regulatory texts (Jadala et al., 2024). An important element is the data management platform (DataVera EKG Provider (DataVera, 2025)), which ensures information storage in accordance with the ontological model, supports both synchronous and asynchronous APIs, executes SPARQL queries, and performs data validation using SHACL rules (Ke et al., 2024). The system also includes application software modules, such as the linguistic analysis module for regulatory documents (DataVera EKG LP (DataVera, 2025)) and the semantic analysis module, which identifies contradictions in terminology and detects duplicated provisions. Monitoring and logging tools, such as ELK and Zabbix, are used to ensure system oversight and log collection (Bilobrovets et al., 2023).The system is implemented as a set of containers deployed in a Kubernetes environment (Poniszewska-Marańda et al., 2021), which ensures its scalability and fault tolerance.The processing of regulatory texts is performed in stages, starting with grammatical and semantic analysis (DataVera, 2025):Sentence structure analysis includes POS-tagging and dependency parsing, which allows for the identification of parts of speech and the establishment of grammatical dependencies between words. Coreference resolution is also performed, involving the substitution of nouns for pronouns and clarification of implied elements in the statement.Lemmatization ensures the conversion of word forms to their base form, simplifying subsequent processing and matching.Semantic matching involves identifying the concepts corresponding to the words in the sentence based on ontological models. In the absence of an exact match in the existing ontology, the system automatically generates ad hoc concepts limited to the specific context of the document.Formation of the semantic profile involves identifying subjects, predicates, modalities, objects, circumstances, and other elements necessary for the structured representation of regulatory content.The result of the algorithm's operation is the formalized representation of each statement in the form of a set of semantic profiles, suitable for further analysis. Based on the obtained semantic profiles, a comparison of regulatory provisions is performed, allowing for the identification of contradictions, duplication, and semantic inconsistencies.The identification of contradictions in terminology is carried out by analyzing statements that contain definitions of regulatory terms. The comparison of such statements allows for classifying the results into three groups (Liu et al., 2020):Semantic equivalence (the definitions are identical or close in meaning).Difference in scope (one definition is a specific case of the other).Semantic contradiction, when mutually exclusive interpretations of the same term are identified.The search for duplicated regulatory provisions is performed by comparing the key elements of the semantic profile. If statements from different documents have matching predicates, objects, subjects, modalities, and additional parameters, the system calculates a numerical similarity metric. If the threshold value is exceeded, the statements are considered duplicated.Similarly, contradictory statements are identified. If two statements refer to the same entity (matching subject, predicate, and object) but have different modalities, a logical contradiction is detected. In cases where additional elements of the semantic description differ, the inconsistency is evaluated quantitatively. If the discrepancy exceeds the established threshold, the divergences are forwarded for expert analysis.The developed method for analyzing regulatory documents has a number of limitations related to the depth of semantic processing. First, the system evaluates the semantic profile of each statement in isolation, which excludes the possibility of analyzing situations where a single statement in one document corresponds to multiple statements in another. Second, the current implementation does not account for the temporal aspect of regulatory provisions, meaning it does not analyze to which time period a particular directive applies (past, present, or likely future). Third, the system does not generate a comprehensive semantic description of the situations to which the requirements apply, but is limited to representing the regulatory directive in a structured form. While this simplifies the development and implementation of the system, such a level of formalization is insufficient for automated compliance checking and is intended solely for identifying inconsistencies and duplications in regulatory provisions.To address the identified limitations, it is proposed to further develop the methodology across several interrelated directions. One of the key vectors is the development of a mechanism for inter-document semantic aggregation, which would enable the establishment of relationships such as equivalence, specification, logical entailment, and subordination between regulatory statements—both within a single document and across multiple sources. This would allow for the modeling of complex regulatory dependencies and improve the accuracy of contradiction detection.Special attention is planned to be given to incorporating the temporal aspect of regulatory requirements. This involves annotating regulatory provisions with temporal markers (such as effective date, duration, and period of applicability), followed by integration with temporal ontologies.Such an approach will enable the tracking of regulatory evolution and the assessment of the applicability of provisions at a given point in time.Another important direction is the modeling of regulatory situations through the expansion of the ontological model by incorporating concepts that describe typical scenarios for the application of requirements. This creates a foundation for shifting from the analysis of isolated provisions to a comprehensive assessment of regulatory conditions based on the context of design or operation of built assets. Such a level of detail will enhance the practical relevance of the developed system in professional practice.To improve the completeness and validity of the analysis, it is proposed to integrate logic-based semantic reasoning using ontological rule languages such as SHACL or SWRL. This will enable not only the interpretation of individual statements, but also the formalization of logical relationships between them, thereby allowing for deductive consistency checking of regulatory requirements.Finally, an important element of future work is the implementation of a contextual semantic disambiguation mechanism using trainable language models (e.g., BERT or GPT) adapted to a corpus of regulatory texts. The use of such models will enable accurate interpretation of terms and constructions depending on their usage context, especially in cases where the same concept may have different meanings in different sections or documents.The implementation of the proposed directions will eliminate current limitations and significantly expand the functional capabilities of the system. This will pave the way for the development of a full-featured intelligent platform for regulatory analysis, capable of supporting tasks related to design, expert review, auditing, and legal compliance in the context of the construction industry's digital transformation.The proposed architecture and methodology enable effective analysis of regulatory documents in the construction sector by providing their structured representation, identifying semantic inconsistencies, and supporting the development of a more coherent regulatory framework.4ResultsTo assess the applicability of the proposed approach, the study employed the EKG LP software suite, developed to address a wide range of text processing tasks. The choice of this software is justified by its ability not only to extract key entities and relationships from text, but also to generate an ontological representation of document structure, which is critically important for analyzing complex regulatory acts. Unlike many other systems, EKG LP provides built-in tools for constructing knowledge graphs and performing semantic annotation, enabling the automation of regulatory requirement interpretation, contradiction and the formalization of logical relationships between the software is with and document management systems, it in the context of digital in the construction EKG LP has not construction its is within aimed at the of the regulatory and including for information modeling and digital codes and The study the applicability of this in the context of construction its relevance and within this first of text processing in EKG LP involves and grammatical structure of performed using tools from the SpaCy et al., 2024). analysis, each element of the text is and syntactic and the identified grammatical dependencies are structured These dependencies are using the and are in The of this processing from the EKG LP is in an the of elements and structures be designed to and from foundation and other Fig. of and grammatical structure of the sentence Fig. of and grammatical structure of the result of this processing is a structured representation of the in which each word and is to its form with its grammatical the of the representation, the words of the sentence are to the identified syntactic the of the each word is with its of speech and the of syntactic it with other sentence elements enabling further processing at the level of semantic on the data EKG LP constructs a of the statement the structure of which is to the model used in the et al., this the semantic structure of the text is including the key components of the predicate, object, subject, and the analysis of a specific where in the are designed to the following semantic components are as the subject, as the predicate to the base form and as the dependency are for both the subject and the object, enabling a more description of regulatory provisions and to the identification of their semantic The semantic structures are used to contradictions, duplicated provisions, and semantic inconsistencies in regulatory the proposed approach, the EKG LP software developed for the automated analysis of regulatory purpose in this study is to duplicated analyze the semantic similarity of and contradictions in regulatory its EKG LP generates a for each of key subject, predicate, object, modality, and The analysis that this structure is for representing regulatory documents in the construction sector are by a of syntactic a the further to enable more accurate modeling of complex the current of the in comparing with semantic analyzing two statements et al., EKG LP generates their semantic profiles, which out to be with only in The software calculates a semantic similarity from to where meanings and In the the value a of similarity between the a threshold for this metric, it to regulatory requirements that are duplicated within a single document or across different regulatory sources. This the applicability of the proposed methodology for the automated of regulatory information et al., comparing semantic profiles, statements are considered only the same the comparison result is set to may in cases where the in modality (e.g., or when one of the statements includes predicate negation (e.g., not of the limitations of the is that it does not account for the semantic similarity of individual a that are in meaning but in may a semantic similarity of address this two be models (e.g., which enable the assessment of term similarity based on their contextual for texts, such models limited as terms to be close to each the from to concepts using a ontological model, which it to account for hierarchical relationships between such as equivalence, and terms. This approach more accurate of semantic similarity and allows for the identification of logical contradictions at a the of using a model, two the of information assets, it is necessary to the of their data the of their be the semantic equivalence of statements, their which the to a semantic similarity of this a model incorporating the following ontology deployed in the system (Fig. which is by EKG the processing of semantic profiles, the with their corresponding concepts allowing for a more accurate of a result of the semantic similarity for the considered their In this approach, terms with or meanings are as which the of the Fig. in the results of a SPARQL to assess the scalability of the proposed approach, a preliminary of the of regulatory documents in the construction sector The analysis performed using to extract text from regulatory documents by the text used for the regulatory documents with a of which is to an regulatory the of the (the documents of construction information modeling, this be considered a suitable for of the proposed the of the on this it to key characteristics of the and its applicability to regulatory to ensure a of and of the method in structure, and further expansion of the corpus is the of the and document including a range of regulations (such as and as well as covering documents with structural will enhance the validity of the obtained An corpus will it to more the of semantic the across a of and in the ontological the of not only in the of the but also in the ability of the developed method to to of regulatory aspect that is critically important for its future practical application in the context of a regulatory on performed using the following characteristics text of in the by of the document significant of regulatory documents a limited set of key which the of constructing a model an analysis of the words in the documents The account for of the text a of in regulatory This supports the of effective of including the establishment of semantic and analysis that the use of semantic profiles in with a ontological model effective identification of duplicated regulatory requirements and assessment of their semantic The developed method also supports the of logical contradictions at the level of terms and their of an ontological model in of comparison a different level of text analysis. While provides only matching of word the ontological model allows for the of hierarchical and relationships between their contextual and their with specific concepts. This approach a and more grounded of regulatory texts, which is critically important for the automated interpretation of requirements and the identification of logical relationships between document results obtained in the of the study that the construction of a model of terminology is a but that significantly enhance the accuracy and completeness of automated analysis of regulatory documents in the construction of and the proposed using practical a of to duplicated and regulatory provisions in construction following documents used as of in of in documents contain a significant number of identical or definitions and provisions, for analyzing the capabilities of the developed semantic analysis, the texts preliminary processing aimed at elements that the automated of regulatory this stage, textual data from the using the et al., allowing to be obtained in a structured elements of the and other components not the semantic the data in the EKG LP in of semantic profiles of The final processing comparing the obtained semantic structures of the two documents at the the EKG LP identifies matching definitions in both the analysis, it a duplicated A or that be or to be by or of a to the specifics of the SpaCy in the grammatical structure analysis of the same may a the similarity for definitions does not improve the implementation of an additional is which would into account the and set of words in the This would allow for more of textual of in the to the also detects statements in both information management includes compliance with and requirements (the for building information modeling information the and of the information model and to information and developed method The following obtained the and grammatical structure of statements between two documents is performed in to the processing it to a method for document In a database of grammatical results for regulatory be enabling comparison of each document with to automatically duplicated and contradictory developed method for analyzing regulatory documents the of contradiction based on the comparison of semantic profiles of In inconsistencies may from in numerical in modality, or the of the following regulatory a on be from the information object or the of a in this the semantic profiles of numerical characteristics will be with the only of discrepancy the in numerical the sentence structure such as and their comparison does not and be implemented as a application of contradiction is a in the following the it is necessary to into account the of natural the it is to into account the of natural the similarity in the structure of the their semantic profiles to the use of different and analysis, this results in a semantic similarity value of a of semantic between the A result would be obtained in the case of modalities (e.g., contradiction method is to at three of in numerical within regulatory in the modality of of negation that the meaning of the be the level of necessary at to improve the accuracy and of the In expert the context of statements not by the cases of discrepancies related to the of the identified automated analysis with expert ensures more and of contradictions in regulatory such inconsistencies, the a semantic similarity threshold on the similarity of semantic profiles exceeds but one of the discrepancies is the statements are considered contradictory and are for further with assess the of the developed a comparative analysis its characteristics other used for automated analysis of regulatory The following matching method based on and similarity et al., modeling using et al., model, representing a of models for semantic similarity et al., proposed ontological which the formalization of statements in the form of semantic profiles and ontological relationships et al., comparison carried out using the following the of and contradictions identified by the the of and contradictions by the the of and expert of the of the and the of its results a from to comparative analysis that the proposed ontological method for semantic analysis of regulatory documents in identifying duplicated and contradictory provisions. to the method is to or existing including models based on significantly on the expert text processing the proposed methodology into account the specifics of regulatory the of logical the subject structure of and the of The use of a for representing regulatory provisions in the form of with ontological not only the identification of semantic discrepancies but also provides a foundation for the automated logical of the consistency of regulatory the developed approach as the foundation for building intelligent for regulatory analysis providing both and in comparison results the relevance and practical of the proposed methodology in the context of the of the construction and the of the regulatory for the integrate the developed approach into regulatory analysis processes, and regulatory management in the construction industry, the following of practical application be at the regulatory The method be implemented as a in the regulatory document automatically duplicated and contradictory provisions between existing and proposed This is when of and for the digital of the regulatory The proposed approach be within the of the regulatory and including the construction of organized of building codes and their automated This creates the foundation for from textual representation of requirements to their into information the developed method into software supporting building information modeling will allow for of design current regulations at the of design, to regulatory This is especially when compliance for model and of ensure effective it is to develop for in and design, the semantic profile the of ontology and the interpretation of analysis for the development of The method be applied the regulatory to documents with existing assess the consistency of provisions, and ensure the of in cases where documents of different industry, are proposed methodology has a of and be into the of in the construction regulatory and expert to design and this approach will enhance the consistency of the regulatory the of regulatory contradictions, and a more digital of the construction work proposes and a method for the automated analysis of regulatory documents in the construction industry, based on a of natural language processing (NLP) and ontological The developed ensures the of semantic profiles for regulatory provisions, identification of duplicated and contradictory statements, and of the relevance of documents.The use of an approach allows for the formalization of knowledge in regulatory documents and its integration into digital for regulatory requirements. The developed methodology its through the analysis of the building codes of the Republic of its ability to logical inconsistencies, the of regulatory provisions, and requirements across different have the of the developed it suitable for use in regulatory document analysis In the processing of a single document does not and the comparison of regulatory provisions is in This the implementation of a concept for comparative analysis of regulatory documents, identifying inconsistencies across of legal the the further directions for future the to the of regulatory statements of document structure articles, to improve the processing of complex regulatory semantic disambiguation using language models to enhance text analysis the system into the to ensure automated compliance of design with regulatory results of the study that the use of ontological modeling with is a direction for the automated analysis of regulatory The developed method as the foundation for intelligent systems, supporting the of the construction and the of regulatory