Loading...
Spectral data augmentation and classification: Methods, resources, and applications for raman and near-infrared spectroscopy
Files
2026flanaganphd.pdf
Adobe PDF, 17.7 MB
- Embargoed until 2026-10-16
Citations
Altmetric:
Publication Date
2026-04-17
Type
doctoral thesis
Downloads
Citation
Abstract
Spectral analysis methods have become increasingly interdisciplinary and ubiquitous across fields including pharmaceutical development, agriculture, and food safety, among others. Recent literature has focused on generative artificial intelligence (AI) and complex deep learning models to address the persistent challenge of limited data. This challenge arises from the niche nature and expensive processes associated with generating large spectral datasets. While deep learning and generative AI methods offer the potential to expand spectral datasets, an understanding of the fundamental spectral data characteristics, as well as the intricacies of the state-of-the-art, is essential to avoid over-engineered models that fail to generalise. Consequently, many traditional methods that respect the inherent properties of spectral data are often overlooked, with current literature rarely providing guidance on their implementations and implications. Our contributions aim to address these challenges surrounding resource accessibility and computational analysis methods for two vibrational spectroscopy techniques, namely Raman and near-infrared (NIR) spectroscopy.
Specifically, this thesis provides five key contributions in computational spectroscopic analysis. First, we present a systematic review and tutorial on state-of-the-art spectral preprocessing, data augmentation, and generative AI methods. This provides essential guidance and clarity on the methods for spectroscopic applications while addressing gaps in foundational techniques, reproducible implementations, and the ethical considerations involved in managing data integrity. Second, we investigate the impact of synthetic data augmentation on deep neural networks (NN) trained on limited Raman spectral datasets, establishing upper bounds on synthetic data requirements and evaluating the cost-benefit considerations in terms of computational resources and implementation effort. Third, we demonstrate that simpler modelling approaches can achieve competitive performance when deep learning is impractical, specifically showing how one-vsrest (OVR) classification strategies outperform traditional multi-class approaches for NIR spectra. Fourth, we introduce a comprehensive open-source Raman spectral dataset comprising 3,510 spectra of thirty-two pure solvents and reagents commonly used in active pharmaceutical ingredient (API) development. We outline the protocols for acquiring, annotating, and releasing this dataset, and present our analysis on improving data quality alongside benchmark evaluations using machine learning methods. Finally, we demonstrate how transfer learning and data augmentation can significantly improve both model robustness and state-of-the-art performance when working with limited data.
These contributions provide practical advice for selecting appropriate computational methods in spectroscopic analysis. They also demonstrate the conditions under which synthetic data augmentation provides genuine benefits and establish alternative classification strategies that can outperform complex deep learning approaches. Additionally, they make valuable opensource resources available to address data accessibility challenges for pharmaceutical-based tasks and utilise these resources in practical settings to improve performance and generalisability.
Publisher
University of Galway
Publisher DOI
Rights
CC BY-NC-ND