Knowledge graph mining with latent shape graphs

Muñoz, Emir
Knowledge graphs are graph-structured knowledge bases that have shown to be of great value in many Artificial Intelligence applications in academia and industry alike. They are typically generated automatically from un-/semi- structured data sources. The increasing popularity of knowledge graphs has been limited by multiple challenges given the size and quality of the information they contain. This thesis explores the relationship between the quality of knowledge graphs and machine learning technologies used to discover and extract knowledge from them. We focus on quality in terms of completeness and consistency. Knowledge graphs provide the flexibility required for representing knowledge at different scales in open environments such as the Web. However, their versatility makes them have an ever-changing schema, which also makes them hard to summarize and understand their content. Moreover, they are typically never complete—even in very specific domains—and their consistency with respect to a given schema or ontology cannot be guaranteed without the corresponding validation. That lack of an accurate schema has shown to be problematic in use cases where applications might need to rely on the fact that data satisfy a set of constraints. The contribution of this thesis is twofold. Firstly, we propose a scalable data-driven method to exhibit the actual (latent) shape of graph data. We introduce an algorithm for mining relation cardinality bounds and building so-called shapes that exhibit important aspects of the structure (or topological information) of entities and relations in a knowledge graph. Latent shapes also allow us to formalise an approximate algorithm for validating the structure of knowledge graphs. Secondly, we exploit the latent shapes of entities and relations to enhance the performance of machine learning models aimed to predict missing links and complete knowledge graphs. We use local patterns information and graph-based feature models in the Bioinformatics domain for improving the prediction of adverse drug reactions achieving new state-of-the-art results. Finally, we extend latent feature models by encoding the cardinality of relations as a regularisation term used to learn semantic embeddings that improve the precision of downstream prediction tasks in benchmark datasets.
NUI Galway
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland