Machine learning for De Novo peptide identification

McDonnell, Kevin
Proteomics involves the identification and analysis of proteins, therefore providing valuable insight into ecosystem functioning. In this methodology, protein sequences are typically identified using a bottom-up approach whereby short subsequences called peptides are matched to experimental mass spectra using a database search. However, it is reported that on average, 75% of the spectra recovered from experiments remain unidentified. De novo peptide identification is an alternative approach to database searching that uses only the spectrum to identify the peptide sequence. This method has undergone significant recent improvements, in part due to the integration of machine learning models into the algorithms. This thesis explores the strengths and weaknesses of many of the current state-ofthe-art de novo peptide identification algorithms through an extensive evaluation. As understanding the underlying data is key to this analysis, a comprehensive survey of the characteristics of tandem mass spectra is included alongside the performance of the algorithms. An alternative machine learning architecture is then proposed to address the weaknesses found. The proposed novel CNN-GNN peptide ion encoding module was able to identify more peptide ions than the encoding modules used by state-of-the-art de novo peptide identification algorithms in all datasets tested. Finally, the utility of artificial data in the context of de novo peptide identification is explored. Artificial spectra were found to be missing critical noise that was present in real data. However, the quantification and introduction of this noise into to artificial spectra increased their similarity to real spectra, significantly improving their potential for use in the training and testing of models. Based on the results of this thesis we recommend specific research avenues for the design and development of the next generation of de novo peptide identification algorithms. This thesis not only demonstrates the challenges facing de novo peptide identification, but also takes the critical first steps toward overcoming them.
NUI Galway
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland