Schema-agnostic queries for large-schema databases: A distributional semantics approach

Freitas, André
The evolution of data environments towards the growth in the size, complexity, dynamicity and decentralisation (SCoDD) of schemas drastically impacts contemporary data management. The SCoDD trend emerges as a central data management concern in Big Data scenarios, where users and applications have a demand for more complete data, produced by independent data sources, under different semantic assumptions and contexts of use. Most Database Management Systems (DBMSs) today target a closed communication scenario, where the symbolic schema of the database is known a priori by the database user, which is able to interpret it in an unambiguous way. The context in which the data is consumed and produced is well-defined and it is typically the same context in which the data was created. In contrast, data management under the SCoDD conditions target an open communication scenario where the symbolic system of the database is unknown by the user and multiple interpretation contexts are possible. In this case the database can be created under a different context from the database user. The emergence of this new data environment demands the revisit of the semantic assumptions behind databases and the design of data access mechanisms which can support semantically heterogeneous (open communication) data environments. This work aims at filling this gap by proposing a complementary semantic model for databases, based on distributional semantic models. Distributional semantics provides a complementary perspective to the formal perspective of database semantics, which supports semantic approximation as a first-class database operation. Differently from models which describe uncertain and incomplete data or probabilistic databases, distributional-relational models focuses on the construction of conceptual approximation approaches for databases, supported by a comprehensive semantic model automatically built from large-scale unstructured data external to the database, which serves as a semantic/commonsense knowledge base. The semantic model can be used to support schema-agnostic queries, i.e. abstracting the data consumer from a specific conceptualization behind the data. The proposed distributional-relational semantic model is supported by a distributional structured vector space model, named τ −Space, which represents structured data under a distributional semantic model representation which, in coordination with a query planning approach, supports a schema-agnostic query mechanism for large-schema databases. The query mechanism is materialized in the Treo query engine and is evaluated using schema-agnostic natural language queries. The evaluation of the query mechanism confirms that distributional semantics provides a high-recall, medium-high precision, and low maintainability solution to cope with the abstraction and conceptual-level differences in schema-agnostic queries over large-schema/schema-less open domain datasets. Moreover, the compositional semantic model defined by the query planning mechanism supports expressive schema-agnostic queries over large-schema/schema-less open domain datasets. The proposed distributional-relational structured vector space model (τ − Space) materialized as an inverted index, supports the development of a schema-agnostic query mechanism with interactive query execution time.
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland