Speech driven video editing via an audio-conditioned diffusion model
Bigioi, Dan ; Basak, Shubhajit ; Stypułkowski, Michał ; Zieba, Maciej ; Jordan, Hugh ; McDonnell, Rachel ; Corcoran, Peter
Bigioi, Dan
Basak, Shubhajit
Stypułkowski, Michał
Zieba, Maciej
Jordan, Hugh
McDonnell, Rachel
Corcoran, Peter
Loading...
Repository DOI
Publication Date
2024-01-23
Type
journal article
Downloads
Citation
Bigioi, Dan, Basak, Shubhajit, Stypułkowski, Michał, Zieba, Maciej, Jordan, Hugh, McDonnell, Rachel, & Corcoran, Peter. (2024). Speech driven video editing via an audio-conditioned diffusion model. Image and Vision Computing, 142, 104911. doi:https://doi.org/10.1016/j.imavis.2024.104911
Abstract
Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronised without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven video editing. All code, datasets, and models used as part of this work are made publicly available here: https://danbigioi.github.io/DiffusionVideoEditing/
Funder
Publisher
Elsevier
Publisher DOI
https://doi.org/10.1016/j.imavis.2024.104911
Rights
Attribution 4.0 International