Loading...
Thumbnail Image
Publication

Improving disease diagnosis and biomarker identification using data-driven machine learning and knowledge graphs

Verma, Ghanshyam
Citation
Abstract
Respiratory viral infections can lead to severe outcomes among individuals with other aggravating primary diseases, in particular, when these are deleterious to the function of the respiratory system. Such severe cases may increase the likelihood of death in elderly or immunocompromised individuals. Moreover, each influenza epidemic leads to an increase in healthcare costs through excess hospitalizations apart from the need for substantial amounts of vaccines. The spread of respiratory virus diseases affects all age groups and thus can lead to periodic pandemics such as COVID-19. Measurement of gene expression has the potential to uncover internal biological changes that occur during the infection transition phase; therefore, analysis of genes is important in dealing with infectious diseases. In this thesis, we present new algorithms and approaches for early-stage disease diagnosis, biomarker identification, and personalised disease diagnosis using gene expression data. The first key contribution of this thesis is a methodology for early-stage disease diagnosis of respiratory viral infection by analyzing gene expression profiles of subjects using data-driven machine learning. This contribution is important because this can help in detecting the state of infection even before subjects start showing the symptoms of the infection. The second key contribution is a new feature ranking algorithm named Ranked MSD that can compute the importance score for all the given features and rank them according to their computed importance score. Gene expression datasets often contain thousands of features (genes), and identifying a small subset of strongly relevant genes is crucial for biomarker identification and disease prediction using a reduced feature set. To address this, we propose two feature selection algorithms, Find Fequal and Find Fbest. Our Find Fequal algorithm can identify a small subset of strongly relevant features that achieves prediction accuracy statistically equal to that of the full feature set. For example, in the case of Dataset 1, Find Fequal achieved accuracy statistically equal to that of the full feature set using only 65 genes instead of 12,023, and on Dataset 2, it achieved accuracy statistically equal to that of the full 20,737-gene feature set using just 31 genes. Furthermore, our Find Fbest algorithm, can identify a subset of relevant genes that achieves the numerically highest disease prediction accuracy, even higher than that of the full feature set. Find Fequal and Find Fbest contribute to efficient biomarker identification and form the third key contribution of this thesis. The final key contribution is a new approach for enabling personalised disease diagnosis by combining patients’ temporal gene expression data with a Knowledge Graph (KG). We propose two new algorithms, LOADDx and SCADDx that can produce a short personalised ranked list of the most likely diseases for each patient at a requested time-point, from a knowledge base with thousands of diseases. We discover how a patient’s Least Differentially Expressed Genes (LDEGs) along with Most Differentially Expressed Genes (MDEGs) can help in disease diagnosis in the presence of a KG. To the best of our knowledge, LDEGs have not previously been used for disease diagnosis in combination with KGs. We show how KGs that do not include link strength information can be used to infer the strength of links in a patient-specific manner, using the patient's gene expression profile. Both of the algorithms are tested on four real-world gene expression datasets of respiratory viral infection caused by Influenza-like viruses of 19 subtypes. We also compare the performance of proposed algorithms with that of five existing state-of-the-art machine learning algorithms (k-NN, Random Forest, XGBoost, Linear SVM, and SVM with RBF Kernel) using two validation approaches: LOOCV and a single internal validation set. Both SCADDx and LOADDx outperform the existing algorithms when evaluated with both validation approaches. SCADDx is able to detect infections with up to 100% accuracy in the cases of Datasets 2 and 3. Overall, SCADDx and LOADDx are able to detect an infection within 72 hours of infection with 91.38% and 92.66% average accuracy, respectively, considering all four datasets, whereas XGBoost, which performed best among the existing machine learning algorithms, can detect the infection with only 86.43% accuracy on an average. Moreover, the proposed algorithms can provide a short ranked list of the most likely diseases for each patient along with their most affected genes, and other entities linked with them in the KG, which can support health care professionals in their decision-making.
Publisher
University of Galway
Publisher DOI
Rights
CC BY-NC-ND