Permanent URI for this collection
Browse
Recent Submissions
Publication Neuromorphic driver monitoring systems: A computationally efficient proof-of-concept for driver distraction detection(IEEE, 2023-10-18) Shariff, Waseem; Dilmaghani, Mehdi Sefidgar; Kielty, Paul; Lemley, Joe; Farooq, Muhammad Ali; Khan, Faisal; Corcoran, Peter; Irish Research CouncilDriver Monitoring Systems (DMS) represent a promising approach for enhancing driver safety within vehicular technologies. This research explores the integration of neuromorphic event camera technology into DMS, offering faster and more localized detection of changes due to motion or lighting in an imaged scene. When applied to the observation of a human subject event camera provides a new level of sensing capabilities over conventional imaging systems. The study focuses on the application of DMS by incorporating the event cameras, augmented by submanifold sparse neural network models (SSNN) to reduce computational complexity. To validate the effectiveness of proposed machine learning pipeline built on event data we have opted the Driver Distraction as a critical use case. The SSNN model is trained on synthetic event data generated from the publicly available Drive&Act and Driver Monitoring Dataset (DMD) using a video-to-event conversion algorithm (V2E). The proposed approach yields comparable performance with state-of-the-art approaches, achieving an accuracy of 86.25% on the Drive&Act dataset and 80% on comprehensive DMD dataset while significantly reducing computational complexity. In addition, to demonstrate the generalization of our results the network is also evaluated using locally acquired event dataset gathered from a commercially available neuromorphic event sensor.Publication Towards monocular neural facial depth estimation: Past, present, and future(IEEE, 2022-03-11) Khan, Faisal; Farooq, Muhammad Ali; Shariff, Waseem; Basak, Shubhajit; Corcoran, Peter; Science Foundation Ireland; College of Science and Engineering, University of Galway; Xperi GalwayThis article contains all of the information needed to conduct a study on monocular facial depth estimation problems. A brief literature review and applications on facial depth map research were offered first, followed by a comprehensive evaluation of publicly available facial depth datasets and widely used loss functions. The key properties and characteristics of each facial depth map dataset are described and evaluated. Furthermore, facial depth maps loss functions are briefly discussed, which will make it easier to train neural facial depth models on a variety of datasets for both short- and long-range depth maps. The network’s design and components are essential, but its effectiveness is largely determined by how it is trained, which necessitates a large dataset and a suitable loss function. Implementation details of how neural depth networks work and their corresponding evaluation matrices are presented and explained. In addition, an SoA neural model for facial depth estimation is proposed, along with a detailed comparison evaluation and, where feasible, direct comparison of facial depth estimation methods to serve as a foundation for a proposed model that is utilized. The model employed shows better performance compared with current state-of-the-art methods when tested across four datasets. The new loss function used in the proposed method helps the network to learn the facial regions resulting in an accurate depth prediction. The network is trained on synthetic human facial depth datasets whereas for validation purposes real as well as synthetic facial images are used. The results prove that the trained network outperforms current state-of-the-art networks performances, thus setting up a new baseline method for facial depth estimations.Publication Evaluation of thermal imaging on embedded GPU platforms for application in vehicular assistance systems(IEEE, 2022-03-09) Farooq, Muhammad Ali; Shariff, Waseem; Corcoran, PeterThis study is focused on evaluating the real-timeperformance of thermal object detection for smart and safe vehicular systems by deploying the trained networks on GPU & single-board EDGE-GPU computing platforms for onboard automotive sensor suite testing. A novel large-scale C3I Thermal Automotive dataset comprising of >35,000 distinct frames is acquired, processed, and open-sourced in challenging weather and environmental scenarios. The dataset is recorded from a lost-cost yet effective uncooled LWIR thermal camera, mounted stand-alone and on an electric vehicle to minimize mechanical vibrations. The state-of-the-art YOLO-v5 networks variants are trained using four different public datasets as well newly acquired local dataset for optimal generalization of DNN by employing SGD optimizer. The effectiveness of trained networks is validated on extensive test data using various quantitative metrics which include precision, recall curve, mean average precision, and frames per second. The smaller network variant of YOLO is further optimized using TensorRT inference accelerator to explicitly boost the frames per second rate. Optimized network engine increases the frames per second rate by 3.5 times when testing on low power edge devices thus achieving 11 fps on Nvidia Jetson Nano and 60 fps on Nvidia Xavier NX development boards.Publication A text-to-speech pipeline, evaluation methodology, and initial fine-tuning results for child speech synthesis(IEEE, 2022-04-28) Jain, Rishabh; Yiwere, Mariam Yahayah; Bigioi, Dan; Corcoran, Peter; Cucu, Horia; Enterprise IrelandSpeech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets. This approach adopts a multi-speaker TTS retuning workflow to provide a transfer-learning pipeline. A publicly available child speech dataset was cleaned to provide a smaller subset of approximately 19 hours, which formed the basis of our fine-tuning experiments. Both subjective and objective evaluations were performed using a pretrained MOSNet for objective evaluation and a novel subjective framework for mean opinion score (MOS) evaluations. Subjective evaluations achieved the MOS of 3.95 for speech intelligibility, 3.89 for voice naturalness, and 3.96 for voice consistency. Objective evaluation using a pretrained MOSNet showed a strong correlation between real and synthetic child voices. Speaker similarity was also verified by calculating the cosine similarity between the embeddings of utterances. An automatic speech recognition (ASR) model is also used to provide a word error rate (WER) comparison between the real and synthetic child voices. The final trained TTS model was able to synthesize child-like speech from reference audio samples as short as 5 seconds.Publication iFace 1.1: A proof-of-concept of a facial authentication based digital ID for smart cities(IEEE, 2022-07-01) Mitra, Alakananda; Bigioi, Dan; Mohanty, Saraju P.; Corcoran, Peter; Kougianos, Elias“Smart Cities” are a viable option to various issues caused by accelerated urban growth. To make smart cities a reality, smart citizens need to be connected to the “Smart City” through a digital ID. A digital ID enables citizens to utilize smart city facilities such as healthcare, transportation, finance, and energy with ease and efficiency. In this paper, we propose a proof-of-concept of a facial authentication-based end-to-end digital ID system for a smart city. Facial authentication systems are prone to various biometric template attacks and cyber security attacks. Our proposed system is designed to detect the first type of attack, especially deepfake and presentation attacks. Users are authenticated each time they use facilities in a smart city. Facial data is stored in the cloud in a lookup table format with an unidentifiable username. The process is very secure as no data leaves the device during authentication. Our proposed solution achieved 97% accuracy in authentication with a False Rejection Ratio of 2% and False Acceptance Ratio of 3%.Publication Toward robust facial authentication for low-power edge-AI consumer devices(IEEE, 2022-11-24) Yao, Wang; Varkarakis, Viktor; Costache, Gabriel; Lemley, Joseph; Corcoran, Peter; Irish Research CouncilRobust authentication for low-power consumer devices without a keyboard remains a challenge. The recent availability of low-power neural accelerator hardware, combined with improvements in neural facial recognition algorithms provides enabling technology for low-power, on-device facial authentication. The present research work explores a number of approaches to test the robustness of a state-of-the-art facial recognition (FR) technique, Arcface for such end-to-end applications. As extreme lighting conditions and facial pose are the two more challenging scenarios for FR we focus on these. Due to the general lack of large-scale multiple-identity datasets, GAN-based re-lighting and pose techniques are used to explore the effects on FR performance. These results are further validated on the best available multi-identity datasets - MultiPIE and BIWI. The results show that FR is quite robust to pose variations up to 45–55 degrees, but the outcomes are not definitive for the tested lighting scenarios. For lighting, the tested GAN-based relighting augmentations show significant effects on FR robustness. However, the lighting scenarios from MultiPIE dataset - the best available public dataset - show some conflicting results. It is unclear if this is due to an incorrectly learned GAN relighting transformation or, alternatively, to mixed ambient/directional lighting scenes in the dataset. However, it is shown that the GAN-induced FR errors for extreme lighting conditions can be corrected by fine-tuning the FR network layers. The conclusions support the feasibility of implementing a robust authentication method for low-power consumer devices.Publication Pose-aware speech driven facial landmark animation pipeline for automated dubbing(IEEE, 2022-12-20) Bigioi, Dan; Jordan, Hugh; Jain, Rishabh; McDonnell, Rachel; Corcoran, Peter; Science Foundation IrelandA novel neural pipeline allowing one to generate pose aware 3D animated facial landmarks synchronised to a target speech signal is proposed for the task of automatic dubbing. The goal is to automatically synchronize a target actors’ lips and facial motion to an unseen speech sequence, while maintaining the quality of the original performance. Given a 3D facial key point sequence extracted from any reference video, and a target audio clip, the neural pipeline learns how to generate head pose aware, identity aware landmarks and outputs accurate 3D lip motion directly at the inference stage. These generated landmarks can be used to render a photo-realistic video via an additional image to image conversion stage. In this paper, a novel data augmentation technique is introduced that increases the size of the training dataset from N audio/visual pairs up to NxN unique pairs for the task of automatic dubbing. The trained inference pipeline employs a LSTM-based network that takes Mel-coefficients as input from an unseen speech sequence, combined with head pose, and identity parameters extracted from a reference video to generate a new set of pose aware 3D landmarks that are synchronized with the unseen speech.Publication A robust light-weight fused-feature encoder-decoder model for monocular facial depth estimation from single images trained on synthetic data(IEEE, 2023-04-17) Corcoran, Peter; Khan, Faisal; Shariff, Waseem; Farooq, Muhammad Ali; Basak, Shubhajit; University of Galway; Xperi GalwayDue to the real-time acquisition and reasonable cost of consumer cameras, monocular depth maps have been employed in a variety of visual applications. Regarding ongoing research in depth estimation, they continue to suffer from low accuracy and enormous sensor noise. To improve the prediction of depth maps, this paper proposed a lightweight neural facial depth estimation model based on single image frames. Following a basic encoder-decoder network design, the features are extracted by initializing the encoder with a high-performance pre-trained network and reconstructing high-quality facial depth maps with a simple decoder. The model can employ pixel representations and recover full details in terms of facial features and boundaries by employing a feature fusion module. When tested and evaluated across four public facial depth datasets, the suggested network provides more reliable and state-of-the-art results, with significantly less computational complexity and a reduced number of parameters. The training procedure is primarily based on the use of synthetic human facial images, which provide a consistent ground truth depth map, and the employment of an appropriate loss function leads to higher performance. Numerous experiments have been performed to validate and demonstrate the usefulness of the proposed approach. Finally, the model performs better than existing comparative facial depth networks in terms of generalization ability and robustness across different test datasets, setting a new baseline method for facial depth maps.Publication Assessing the physiological effect of non-driving-related task performance in conditionally automated driving systems: A systematic review and meta-analysis protocol(Sage, 2023-05-08) Coyne, Rory; Ryan, Leona; Moustafa, Mohamed; Smeaton, Alan F.; Corcoran, Peter; Walsh, Jane C.; Science Foundation IrelandBackground Level 3 automated driving systems involve the continuous performance of the driving task by artificial intelligence within set environmental conditions, such as a straight highway. The driver's role in Level 3 is to resume responsibility of the driving task in response to any departure from these conditions. As automation increases, a driver's attention may divert towards non-driving-related tasks (NDRTs), making transitions of control between the system and user more challenging. Safety features such as physiological monitoring thus become important with increasing vehicle automation. However, to date there has been no attempt to synthesise the evidence for the effect of NDRT engagement on drivers’ physiological responses in Level 3 automation. Methods A comprehensive search of the electronic databases MEDLINE, EMBASE, Web of Science, PsycINFO, and IEEE Explore will be conducted. Empirical studies assessing the effect of NDRT engagement on at least one physiological parameter during Level 3 automation, in comparison with a control group or baseline condition will be included. Screening will take place in two stages, and the process will be outlined within a PRISMA flow diagram. Relevant physiological data will be extracted from studies and analysed using a series of meta-analyses by outcome. A risk of bias assessment will also be completed on the sample. Conclusion This review will be the first to appraise the evidence for the physiological effect of NDRT engagement during Level 3 automation, and will have implications for future empirical research and the development of driver state monitoring systems.Publication A WAV2VEC2-based experimental study on self-supervised learning methods to improve child speech recognition(IEEE, 2023-05-10) Jain, Rishabh; Barcovschi, Andrei; Yiwere, Mariam Yahayah; Bigioi, Dan; Corcoran, Peter; Cucu, Horia; Science Foundation IrelandDespite recent advancements in deep learning technologies, Child Speech Recognition remains a challenging task. Current Automatic Speech Recognition (ASR) models require substantial amounts of annotated data for training, which is scarce. In this work, we explore using the ASR model, wav2vec2, with different pretraining and finetuning configurations for self-supervised learning (SSL) toward improving automatic child speech recognition. The pretrained wav2vec2 models were finetuned using different amounts of child speech training data, adult speech data, and a combination of both, to discover the optimum amount of data required to finetune the model for the task of child ASR. Our trained model achieves the best Word Error Rate (WER) of 7.42 on the MyST child speech dataset, 2.91 on the PFSTAR dataset and 12.77 on the CMU KIDS dataset using cleaned variants of each dataset. Our models outperformed the unmodified wav2vec2 BASE 960 on child speech using as little as 10 hours of child speech data in finetuning. The analysis of different types of training data and their effect on inference is provided by using a combination of custom datasets in pretraining, finetuning and inference. These ‘cleaned’ datasets are provided for use by other researchers to provide comparisons with our results.Publication C3I-SynFace: A synthetic head pose and facial depth dataset using seed virtual human models.(Elsevier, 2023-03-29) Basak, Shubhajit; Khan, Faisal; Javidnia, Hossein; Corcoran, Peter; McDonnell, Rachel; Schukat, Michael; Science Foundation IrelandThis article presents C3I-SynFace: a large-scale synthetic human face dataset with corresponding ground truth annotations of head pose and face depth generated using the iClone 7 Character Creator “Realistic Human 100” toolkit with variations in ethnicity, gender, race, age, and clothing. The data is generated from 15 female and 15 male synthetic 3D human models extracted from iClone software in FBX format. Five facial expressions - neutral, angry, sad, happy, and scared are added to the face models to add further variations. With the help of these models, an open-source data generation pipeline in Python is proposed to import these models into the 3D computer graphics tool Blender and render the facial images along with the ground truth annotations of head pose and face depth in raw format. The datasets contain more than 100k ground truth samples with their annotations. With the help of virtual human models, the proposed framework can generate extensive synthetic facial datasets (e.g., head pose or face depths datasets) with a high degree of control over facial and environmental variations such as pose, illumination, and background. Such large datasets can be used for the improved and targeted training of deep neural networks.Publication Real-time multi-task facial analytics with event cameras(IEEE, 2023-07-20) Ryan, Cian; Elrasad, Amr; Shariff, Waseem; Lemley, Joe; Kielty, Paul; Hurney, Patrick; Corcoran, Peter; Irish Research CouncilEvent cameras, unlike traditional frame-based cameras, excel in detecting and reporting changes in light intensity on a per-pixel basis. This unique technology offers numerous advantages, including high temporal resolution, low latency, wide dynamic range, and reduced power consumption. These characteristics make event cameras particularly well-suited for sensing applications such as monitoring drivers or human behavior. This paper presents a feasibility study on the using a multitask neural network with event cameras for real-time facial analytics. Our proposed network simultaneously estimates head pose, eye gaze, and facial occlusions. Notably, the network is trained on synthetic event camera data, and its performance is demonstrated and validated using real event data in real-time driving scenarios. To compensate for global head motion, we introduce a novel event integration method capable of handling both short and long-term temporal dependencies. The performance of our facial analytics method is quantitatively evaluated in both controlled lab environments and unconstrained driving scenarios. The results demonstrate that useful accuracy and computational speed is achieved by the proposed method to determining head pose and relative eye-gaze direction. This shows that neuromorphic facial analytics can be realized in real-time and are well-suited for edge/embedded computing deployments. While the improvement ratio in comparison to existing literature may not be as favorable due to the unique event-based vision approach employed, it is crucial to note that our research focuses specifically on event-based vision, which offers distinct advantages over traditional RGB vision. Overall, this study contributes to the emerging field of event-based vision systems and highlights the potential of multitask neural networks combined with event cameras for real-time sensing of human subjects. These techniques can be applied in practical applications such as driver monitoring systems, interactive human-computer systems and for human behavior analysis.Publication Neuromorphic driver monitoring systems: A proof-of-concept for yawn detection and seatbelt state detection using an event camera(IEEE, 2023-09-05) Kielty, Paul; Dilmaghani, Mehdi Sefidgar; Shariff, Waseem; Ryan, Cian; Lemley, Joe; Corcoran, Peter; Science Foundation IrelandDriver monitoring systems (DMS) are a key component of vehicular safety and essential for the transition from semi-autonomous to fully autonomous driving. Neuromorphic vision systems, based on event camera technology, provide advanced sensing in motion analysis tasks. In particular, the behaviours of drivers’ eyes have been studied for the detection of drowsiness and distraction. This research explores the potential to extend neuromorphic sensing techniques to analyse the entire facial region, detecting yawning behaviours that give a complimentary indicator of drowsiness. A second proof of concept for the use of event cameras to detect the fastening or unfastening of a seatbelt is also developed. Synthetic training datasets are derived from RGB and Near-Infrared (NIR) video from both private and public datasets using a video-to-event converter and used to train, validate, and test a convolutional neural network (CNN) with a self-attention module and a recurrent head for both yawning and seatbelt tasks. For yawn detection, respective F1-scores of 95.3% and 90.4% were achieved on synthetic events from our test set and the “YawDD” dataset. For seatbelt fastness detection, 100% accuracy was achieved on unseen test sets of both synthetic and real events. These results demonstrate the feasibility to add yawn detection and seatbelt fastness detection components to neuromorphic DMS.Publication A study on the effect of ageing in facial authentication and the utility of data augmentation to reduce performance bias across age groups(IEEE, 2023-09-06) Yao, Wang; Farooq, Muhammad Ali; Lemley, Joseph; Corcoran, Peter; Irish Research CouncilThis work presents a study on the effects of aging on the performance and reliability of facial authentication methods. First, a brief review of the literature on the effect of age on face recognition algorithms is presented, followed by a detailed description of the face aging datasets. In contrast with some recent studies, we demonstrate significant variations in authentication robustness between age groups. The second part of this paper focuses on a comprehensive comparative assessment on the effects across age groups. Four different face recognition algorithms are studied of which three are state-of-the-art neural network based models and the fourth one is a conventional machine learning model. Two different age range threshold settings (±3 in Experiment Category A and ±5 in Experiment Category B) of the age groups are adopted in the experimental analysis to get a proper comparison. Moreover, a synthetic aging method has been incorporated to augment the age data. Experimental result shows that the older adults groups are easier to identify with higher levels of accuracy and robustness compared to other age groups, while younger adults are the most challenging and false authentications are more likely to occur.Publication Augmentation techniques for adult-speech to generate child-like speech data samples at scale(IEEE, 2023-09-20) Yahayah Yiwere, Mariam; Barcovschi, Andrei; Jain, Rishabh; Cucu, Horia; Corcoran, Peter; Science Foundation IrelandTechnologies such as Text-To-Speech (TTS) synthesis and Automatic Speech Recognition (ASR) have become important in providing speech-based Artificial Intelligence (AI) solutions in today’s AI-centric technology sector. Most current research work and solutions focus largely on adult speech compared to child speech. The main reason for this disparity can be linked to the limited availability of children’s speech datasets that can be used in training modern speech AI systems. In this paper, we propose and validate a speech augmentation pipeline to transform existing adult speech datasets into synthetic child-like speech. We use a publicly available phase vocoder-based toolbox for manipulating sound files to tune the pitch and duration of the adult speech utterances making them sound child-like. Both objective and subjective evaluations are performed on the resulting synthetic child utterances. For the objective evaluation, the similarities of the selected top adults’ speaker embeddings are compared before and after the augmentation to a mean child speaker embedding. The average adult voice is shown to have a cosine similarity of approximately 0.87 (87%) relative to the mean child voice after augmentation, compared to a similarity of approximately 0.74 (74%) before augmentations. Mean Opinion Score (MOS) tests were also conducted for the subjective evaluation, with average MOS scores of 3.7 for how convincing the samples are as child-speech and 4.6 for how intelligible the speech is. Finally, ASR models fine-tuned with the augmented speech are tested against a baseline set of ASR experiments showing some modest improvements over the baseline model finetuned with only adult speech.Publication Multilingual video dubbing—a technology review and current challenges(Frontiers Media, 2023-09-25) Bigioi, Dan; Corcoran, Peter; Science Foundation IrelandThe proliferation of multi-lingual content on today’s streaming services has created a need for automated multi-lingual dubbing tools. In this article, current state-of-the-art approaches are discussed with reference to recent works in automatic dubbing and the closely related field of talking head generation. A taxonomy of papers within both fields is presented, and the main challenges of both speech-driven automatic dubbing, and talking head generation are discussed and outlined, together with proposals for future research to tackle these issues.Publication ChildGAN: large scale synthetic child facial data using domain adaptation in StyleGAN(IEEE, 2023-10-02) Farooq, Muhammad Ali; Yao, Wang; Costache, Gabriel; Corcoran, Peter; Irish Research CouncilIn this research work, we proposed a novel ChildGAN, a pair of GAN networks for generating synthetic boys and girls facial data derived from StyleGAN2. ChildGAN is built by performing smooth domain transfer using transfer learning. It provides photo-realistic, high-quality data samples. A large-scale dataset is rendered with a variety of smart facial transformations: facial expressions, age progression, eye blink effects, head pose, skin and hair color variations, and variable lighting conditions. The dataset comprises more than 300k distinct data samples. Further, the uniqueness and characteristics of the rendered facial features are validated by running different computer vision application tests which include CNN-based child gender classifier, face localization and facial landmarks detection test, identity similarity evaluation using ArcFace, and lastly running eye detection and eye aspect ratio tests. The results demonstrate that synthetic child facial data of high quality offers an alternative to the cost and complexity of collecting a large-scale dataset from real children. The complete dataset along with the trained model are open-sourced on our GitHub website and GitHub page: https://github.com/MAli-Farooq/ChildGAN.Publication Assessing the physiological effect of non-driving-related task performance and task modality in conditionally automated driving systems: A systematic review and meta-analysis(Elsevier, 2023-08-29) Coyne, Rory; Ryan, Leona; Moustafa, Mohamed; Smeaton, Alan F.; Corcoran, Peter; Walsh, Jane C.; Science Foundation IrelandIn conditionally automated driving, the driver is free to disengage from controlling the vehicle, but they are expected to resume driving in response to certain situations or events that the system is not equipped to respond to. As the level of vehicle automation increases, drivers often engage in non-driving-related tasks (NDRTs), defined as any secondary task unrelated to the primary task of driving. This engagement can have a detrimental effect on the driver’s situation awareness and attentional resources. NDRTs with resource demands that overlap with the driving task, such as visual or manual tasks, may be particularly deleterious. Therefore, monitoring the driver’s state is an important safety feature for conditionally automated vehicles, and physiological measures constitute a promising means of doing this. The present systematic review and meta-analysis synthesises findings from 32 studies concerning the effect of NDRTs on drivers’ physiological responses, in addition to the effect of NDRTs with a visual or a manual modality. Evidence was found that NDRT engagement led to higher physiological arousal, indicated by increased heart rate, electrodermal activity and a decrease in heart rate variability. There was mixed evidence for an effect of both visual and manual NDRT modalities on all physiological measures. Understanding the relationship between task performance and arousal during automated driving is of critical importance to the development of driver monitoring systems and improving the safety of this technology.Publication Automatic inspection of seal integrity in sterile barrier packaging: A deep learning approach(IEEE, 2024-01-01) Diaz, Julio Zanon; Farooq, Muhammad Ali; Corcoran, Peter; Boston Scientific Manufacturing Facility, GalwayThe digitalisation of visual tasks through imaging techniques and Computer Vision has the potential to disrupt the manner in which Advanced Manufacturing processes are deployed. In this study we collaborated with the manufacturing industry to investigate the effective usage of end-to-end convolutional neural networks (CNNs) to enable advanced manufacturing processes by inspecting the seal integrity of sterile barrier packaging in highly regulated products, such as Medical Devices. For this purpose, a novel ‘DS1’ dataset of labelled images representative of production samples was acquired in an industrial-like environment which is an open source for future research work. The core focus of this research is to address the common challenges associated with performing quality inspections in advanced manufacturing environments with the aim of detecting defects with very high impact but very low occurrence rates, by incorporating a set of pre-trained deep learning architectures. The performance of state-of-the-art CNNs is validated on unseen test data when trained in small and imbalanced datasets with low image variation and low pixel complexity. The study indicated that while CNN performance drops when datasets are imbalanced, some architectures are more resilient and capable of successfully classifying defects in small datasets in the order of a few hundred samples wherein as little as 5% of the samples are defective. Furthermore, this study also discusses the marginal impact of training with basic data augmentations and the tendency for models to overfit when trained with manufacturing datasets such as “DS1.”Publication Speech driven video editing via an audio-conditioned diffusion model(Elsevier, 2024-01-23) Bigioi, Dan; Basak, Shubhajit; Stypułkowski, Michał; Zieba, Maciej; Jordan, Hugh; McDonnell, Rachel; Corcoran, Peter; Science Foundation IrelandTaking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronised without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven video editing. All code, datasets, and models used as part of this work are made publicly available here: https://danbigioi.github.io/DiffusionVideoEditing/