Publication

Semantic modelling of protein-protein interactions, their prediction and evaluation

Kazemzadeh, Laleh
Citation
Abstract
The amount of biomedical data produced by DNA-sequencing, by curated knowledge on disease mechanisms and treatments, by results from biochemical and pharmaceutical research and by many other data generation studies is escalating at an unconstrained pace. However, this wealth of biomedical data is a precious resource for integrative research studies which draw conclusions through the analysis of all the heterogeneous data for knowledge discovery. One biomedical research domain is the prediction of protein-protein interactions relying on different sources of data which a priori may not directly expose data for protein interactions but may hold hidden information to identify novel interaction candidates. However, data integration and in addition data interpretation have to overcome a number of hurdles, which result from the characteristics of the biomedical data sources, including challenges from data diversity in protein namings, data consistency, analogy, availability and interoperability. The aim of this thesis is harnessing the capability of big biomedical data by integrating its artifacts through the application of Semantic Web and the Linked Data principles for the final goal of predicting protein-protein interactions from the data. A semantic model for protein-protein interaction networks has been developed in this work which is used to identify explicit knowledge on protein interactions. This model is based on protein traits which have been extracted from publicly available biomedical data sources. The research work in this thesis has led to the integration of descriptive features of proteins from public reference data sources denoting known interactions and subsequently initiated the prediction of novel interactions. The prediction model included novel attributes such as the genomic location of the genes and their immediate neighbouring network for each protein. Through the integration of these features, a Naive Bayes approach achieved a prediction accuracy close to 94% measured against a gold standard of known protein-protein interactions. The semantic integration of the biomedical data covers protein data and their interaction networks. This approach used state of the art integration techniques based on Linked Data principles relying on a basic ontology. The data is exposed as visual analytics platform (called LinkedPPI) optimised for intuitive data exploration. A selection of predicted protein interactions has then been validated experimentally through laboratory experiments in order to demonstrate validation of the predicted interactions. The positive outcomes of the experimental validation demonstrate that the prediction model and the data integration form an effective means for the selection of most relevant but yet unknown protein interaction candidates.
Publisher
Publisher DOI
Rights
Attribution-NonCommercial-NoDerivs 3.0 Ireland