Publication

Novel statistical models and computational tools for gene set analysis

Citation
Abstract
The gene is often treated as an observational unit in biology. The signals in a trancriptomic assay are mapped to genes, while the results of genomic and functional assays are frequently linked to genes to increase biological interpretability. However, the function of a particular gene is not always known and can change depending on cellular context. Furthermore, the number of genes identified as worthy of interest can be far too great for researchers to interpret and extract biological insights from them one by one. Analysing groups of genes related by function, partaking in a common biological pathway or sharing biochemical similarities can address these pitfalls. The criteria and knowledge used to construct gene sets are incorporated into the downstream analysis of transcriptomic, genomic and functional assays, focussing the researcher's attention on a comparatively small number of well-defined pathways. In Chapter 2, we outlined the log-fold change distribution as a conceptual framework that can be used to understand the different null hypotheses being tested by various gene set analysis (GSA) tools in the context of differential gene expression. This framework led to the development of a set of GSA tests based on modelling the log-fold change (LFC) distribution as a mixture of Gaussian random variables. The different tests provide parallels to popular GSA methods with significant advantages in sensitivity and interpretability in both simulations and real data analysis. In Chapter 3, we developed additional GSA tests to interpret the results of cell-type-specific differential expression analyses. Inference of cell-type-specific differential expression from a heterogenous sample is associated with high levels of uncertainty. This uncertainty necessitated the development of non-parametric GSA tests based on the LFC distribution, making fewer assumptions and leading to more robust results. Both parametric and non-parametric tests were made available in an R package and an Rshiny application, which allows researchers to efficiently run GSA tests and visualise and interpret results across all cells in the experiment. In Chapter 4, the focus of the thesis shifts from enrichment in the context of differential gene expression to enrichment for variants in genomic regions. We performed genomic enrichment tests on de-novo variants in autism spectrum disorder probands. Groups of genomic regions could be compared with the rest of the genome or between cases and controls. The former tests (internal in that they compared different regions of the genome with each other) aimed to model the distribution of variants across the genome to detect regions with a significantly higher number of variants than expected. The latter (external tests), through the use of an appropriate control cohort, avoided making any assumptions about the distribution of \textit{de-novo} variants along the genome and was better suited for the testing of large numbers of variants not previously associated with the trait.
Publisher
University of Galway
Publisher DOI
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International