Computational approaches to identify and explain sources of error in cancer somatic mutation data

O’Sullivan, Brian
Errors in the identification of somatic mutations in cancer samples can have critical implications in both research and clinical applications. Failure to detect potential variants of interest can lead to missed opportunities in patient treatment or sci entific research. Incorrectly identifying a somatic variant may result in inaccurate prognosis, unsuitable treatment selection, or misleading research. By understanding the sources of error in somatic mutation calling, we are better placed to mitigate these risks. The reevaluation of variants that have been excluded from analysis by mutation calling methodologies can provide valuable insights in this regard. By considering the allele frequency, nucleotide context, and potential impact on pro tein of a mutation that has been discarded from analysis, we can incorporate the overall biological context into our assessment of the variant call. This approach enables us to identify putative somatic variants that were overlooked by the caller and, importantly, investigate the reason for their omission. In Chapter 2, we outline vcfView, an interactive R Shiny tool designed to support the evaluation and exploratory analysis of somatic mutation records from cancer se quencing data. We use vcfView to reevaluate the TCGA acute myeloid leukaemia data and identify clinically actionable mutation records in patients that were incor rectly excluded from analysis due to the presence of tumour sample DNA in the matched normal sample. The validation of somatic mutation calling pipelines is a critical step in ensur ing the accuracy and reliability of the results obtained from the analysis of cancer genomic sequencing data. However, the trustworthiness of the validation results is directly linked to the quality of the truth set used for validation. In Chapter 3, we introduce a simulation framework designed to generate comprehensive and realistic tumour genomic sequencing data. This framework takes into account the inherent randomness of genomic sequencing, providing an accurate representation of the fre quency profile as it is observed in real sequencing data. It generates a corresponding truth set alongside the simulated sequencing data, documenting the true source of each non-reference base in the data. Unlike existing validation methods, this truth set not only identifies variant caller errors but, crucially, enables us to understand the reasons behind the erroneous calls. Using the GATK Mutect2 variant calling pipeline, we apply this framework to highlight and explain sources of error in somatic mutation data and biases in the estimation of somatic allele frequency. Finally in Chapter 4, we analyse tumour-only sequencing and somatic variant data from an unpublished dataset comprising 60 individuals diagnosed with early onset and aggressive pancreatic ductal adenocarcinoma. We apply the tools and methods we have developed previously to recover somatic variant information from sequence data obtained from heavily damaged FFPE samples. We provide an im proved estimate of the true incidence of pathogenic KRAS variants within the cohort that accounts for the sequencing strategy and sample preparation methods used. We also highlight recurrent mutations in several other cancer associated genes that may have played a role in disease progression in these patients.
NUI Galway
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland