Publication

Child speech understanding and generation via neural ASR and TTS models

Jain, Rishabh
Citation
Abstract
Text-to-speech (TTS) and Speech-to-Text (STT) technologies have seen significant improvements in recent years with the introduction of Deep learning-based data-driven approaches, yet the application of these technologies to child speech presents unique challenges. Most current research work and solutions focus largely on adult speech compared to child speech. The main reason for this disparity can be linked to the limited availability of children’s speech datasets and poor data quality that can be used in training modern speech Artificial Intelligence (AI) systems. Child speech datasets often have noisy recordings and lack diversity, resulting in limited, poor quality and less representative datasets for developing effective solutions. Child speech is also notably different from adult speech due to distinctive linguistic and phonetic characteristics, alongside variations in pitch, articulation, and pronunciation. These differences present substantial challenges in the development of effective TTS and STT systems for children. Moreover, ethical considerations and GDPR compliance necessitate careful handling of child speech data, emphasizing the need for legally compliant data collection methods. The shift to DNN and AI-based systems has improved the capacity to train on limited child speech data. Nevertheless, the availability of data remains a challenge, especially when striving to represent the linguistic and phonetic patterns of children from diverse backgrounds. Our research focuses on several key areas: the enhancement of TTS and STT technologies for child speech in a low-resource scenario, the creation and augmentation of child speech datasets, and the integration of these technologies into practical applications such as smart toys capable of interacting with and comprehending children. We explore state-of-the-art (SOTA) methodologies, including the development and optimization of Tacotron 2 and Fastpitch models for child speech synthesis, and the application of wav2vec2, Whisper, and Conformer models for improved child speech recognition. Through the utilization of advanced data augmentation methods, it is also aimed to overcome the limitations posed by the scarcity of child speech data. Additionally, our work contributes to the broader field by developing a facial animation pipeline and creating synthetic-speaking children, addressing both technological and ethical considerations in child speech processing. The main goal of this research is to not only advance the state of child speech technologies but also to ensure their ethical and effective application in smart toys. This comprehensive study represents a significant step forward in the field of speech technology, particularly in making TTS and STT systems more accessible, representative, and effective for child users. By addressing the unique challenges associated with child speech and leveraging the latest advancements in AI and deep learning, we contribute to the development of more interactive, engaging, and supportive technological solutions in this area of research.
Funder
Publisher
University of Galway
Publisher DOI
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International