Publication

Processing the scientific literature for protein relations (for myocardial infarction) and validation of hypothetical relations through experimental results

Halder, Arindam
Citation
Abstract
The last decade has been characterized by a deluge of published scientific literature as well as data generation by biological experiments. The major impediment for the research community to unlock the knowledge contained in literature and further drive biomedical research is the inability to process huge data resources in a faster and efficient manner. Efforts are ongoing to develop knowledge management systems with automated text mining methods at the heart of such systems but the results have been tempered by varying factors such as ease of use, efficiency, reliability, and applicability across different scenarios. Despite the shortcomings, the research community has been able to leverage automatic literature processing to close the information gap between the reporting of facts in literature and the use of validated data in research, and in particular in the area of drug discovery. Within the scope of this thesis, I have brought together state of the art solutions in two different areas of automated text mining: machine learning-based methods and lexicon-based methods in a manner to leverage the advantages of both the approaches while trying to minimize the problems faced when they are used in isolation. The work has been used for the identification of protein interactions to validate the biological mechanisms explaining the repair processes in a myocardial infarction condition. In recent times a less-investigated protein SPARCL1, has come up as a prognostic marker in various types of cancer and wound healing/repair mechanisms in different tissues including in injured cardiomyocytes but the mechanism of action of the protein remains unclear. The presented research work used established data resources for the mining of Medline abstracts and full-text literature. This thesis used features for the relation extraction using the GENIA Corpus, used geniatagger as the machine learning-based text mining component in conjunction with a lexicon of gene and protein names of over 8 million instances derived from HGNC and UniProt. The hybrid NER solution has been evaluated against the BioCreative 2 corpus, leading to 68.7% precision and 64% recall, which is both equivalent (or even superior) to other reported solutions, that use heterogeneous data sources for text mining and include a gene normalisation step. The relation extraction has been evaluated against the IntAct corpus and lead to recall and precision of 3.7% and 28.6%, respectively, which again is equivalent to state of the art solutions. The findings from the automatic text processing have been evaluated against the other available (interactive, non-automatic) data sources for relevant protein relations to identify that the interaction between Decorin and SPARCL1 is a very promising candidate for the regulation of the inflammatory processes in relation to myocardial infarction, and has not been considered as a key candidate in the literature so far. The interaction between Decorin and SPARCL1 has been validated experimentally and provides compelling evidence to be considered as the key factor for mediating repair of cardiomyocytes after myocardial infarction, offering opportunities to develop new therapies. Overall the thesis demonstrates the full workflow for hypothesis generation from the scientific literature up to its validation through experimental results and thus extends the work by Swanson and Smallheiser, where merely the association of not directly related concepts has been proven by finding similarities between sets of documents. The experimental validation aims to provide a high degree of confidence in the deployed workflow and open up avenues for newer therapies in other conditions too.
Publisher
NUI Galway
Publisher DOI
Rights
Attribution-NonCommercial-NoDerivs 3.0 Ireland