Machine learning and high-performance computing: Infrastructure and algorithms for the genome-scale study of genetic and epigenetic regulatory mechanisms with applications in neuroscience

Ó Broin, Pilib
The advent of next-generation sequencing (NGS) has fundamentally changed modern genomics re-search. These sequencers generate terabytes of data and necessitate the use, not only of high-performance compute (HPC) clusters for data processing and storage, but also of intelligent, scalable algorithms for pattern discovery and data mining. This thesis details the development of infrastructure and algorithms which automate much of this data analysis process allowing bench biologists to remain focused on the scientific questions that drive them, rather than the informatics challenges associated with these new platforms. We describe WASP, one of the first end-to-end systems to handle all aspects of NGS data generation, including sample submission, laboratory information management system (LIMS) functionality, and assay-specific processing pipelines. Furthermore, we present two machine learning algorithms for the secondary analysis of ChIP-seq data, the first, based on the use of self-organising maps (SOMs) for improved de novo motif discovery, and the second, which uses genetic algorithms (GAs) to automatically cluster transcription factor binding motifs. Finally, we present an application of this infrastructure and these techniques to the study of the role of the TBX1 transcription factor in 22q11.2 Deletion Syndrome, examining its putative role in neural development, adult neurogenesis, autism spectrum disorder (ASD), and schizophrenia.
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland