Knowledge extraction from simplified natural language text

Abdelaal, Hazem
Knowledge base creation and population are an essential formal backbone for a variety of intelligent applications, decision support and expert systems and intelligent search. Although knowledge extraction from unstructured text offers a means of easing the knowledge acquisition process, the ambiguous nature of language tends to impact on accuracy when engaging in more complex semantic analysis. Controlled Natural Languages (CNLs) are subsets of natural language which are restricted grammatically in order to reduce or eliminate ambiguity for the purposes of machine understanding, or unambiguous human communication within a domain or industry context, such as Simplified English. Moreover, CNLs help engaging non-expert users with no background in knowledge engineering, as these languages offer user-friendly interfaces that are easier to understand and accepted by users. The latter type of human-oriented CNL is under-researched despite having found favor in industry over many years. Rewriting such human-oriented CNL content into a machine-oriented CNL could potentially unlock significant silos of implicit valuable general purpose domain knowledge. In this thesis, we have a developed an approach for a series of corpus based rewriting rules for subsequent knowledge capture. Our work confirms that a substantial amount of human-oriented CNL content can be easily translated into a machine processable CNL for formal knowledge capture with little semantic loss. In addition, we describe a novel dataset which aligns a representative sample of Simplified English Wikipedia sentences with a well known machine-oriented CNL. This linguistic resource is both human-readable and semantically machine interpretable, where it can be used by the community as a gold-standard dataset which can benefit a variety of language processing and knowledge based applications.
NUI Galway
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland