Cataloguing and linking publicly available biomedical SPARQL endpoints for federation - addressing aPosteriori data integration

Hasnain, Syed Muhammad Ali
During recent years the increasing adoption of Open Data Initiatives and Lined Data principles have lead to the creation of a globally distributed space of Linked Data that covers various domains such as Government, Libraries, Life Sciences, Media, Geographic and Social web. Approaches that conceive this data space as a huge distributed data sources and enable an execution of declarative queries over this database hold an enormous potential; they allow users to benefit from a virtually unbounded set of up-to-date data. As a consequence, several research groups have started to study such approaches. The Life Sciences domain has been one of the early adopters of Linked Data, and at present a considerable portion of the Linked Open Data cloud is comprised of datasets from Life Sciences Linked Open Data, known as LS-LOD. Although the publication of datasets as RDF is a necessary step towards achieving unified querying of biological datasets, it is not enough to achieve the interoperability necessary to enable a query-able Web of Life Sciences data. This can be achieved either by “a priori integration”, by ensuring multiple datasets make use of the same vocabularies and ontologies, or, alternatively using “a posteriori integration”, which makes use of mapping rules that change the topology of graphs such that integrated queries become possible. “a posteriori integration”, in Biomedical and Life Science data sources is the topic of this thesis. This dissertation first provides an analysis of freely and openly available data sources (SPARQL endpoints). Public SPARQL endpoints were analysed with two considerations i. What is the content of a public SPARQL endpoint? and ii. How self descriptive are these endpoints? For analysing public SPARQL endpoints we defined a set of self descriptive SPARQL queries. After this analysis we introduce the notion, namely Autonomous Resource Discovery and Indexing (ARDI), for facilitating “a posteriori integration”, in Biomedical and Life Science data sources. In particular, we introduce a Cataloguing and Linking mechanism that enables us to formally query Biomedical and Life Sciences Linked Open Data on the World Wide Web (WWW). As of 31st March 2016, the ARDI consists of 263,731 triples representing 12,658 distinct classes, 1,792 distinct properties and 13,027 distinct Orphan classes catalogued from 137 public SPARQL endpoints. Based on these Cataloguing and Linking approaches, we propose BioFed which is a federated query processing engine for Life Sciences Linked Open Data. BioFed offers a single-point-of-access for distributed Life Science data which enables scientists to access the data from reliable sources without extensive expertise in SPARQL query formulation. BioFed federates SPARQL queries over more than 137 public SPARQL endpoints. After demonstrating ARDI and its practical applications, this dissertation focuses on presenting Linked Biomedical Dataspace (LBDS) that enables the semantically-enriched representation, exposure, interconnection, querying and browsing of Biomedical data and knowledge in a standardised and homogenised way. We provide three practical scenarios known as workflows for using proposed LBDS and also list the Lessons Learned and Recommendations for developing different components of LBDS as we believe our gained insights will be useful for LD practitioners and researchers working on the topics similar to those covered in this thesis.
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland