In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync.
To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the orig inal input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync.
To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution‡ Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talk ing avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop.
Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations.
Using THUNDER's Mesh-to-Speech model, we can estimate spoken audio from a 3D animation of a face.
The architecture of mesh-to-speech is based on the work of Choi et al., a known silent-video-to-speech model. The architecture consists of an input feature encoder, a conformer sequence encoder and two prediction heads, a speech unit classifier, and a mel spectrogram regressor. These two outputs contain enough information that a pretrained off-the-shelf vocoder from Choi et al. can turn them into output audio. Since the input is not a video, but a 3D animation, we replace the original lip video ResNet encoder that Choi et al. use with an MLP that takes 3D lip vertex coordinates as the input.
We run video-to-speech and mesh-to-speech models on the test subjects of the RAVDESS, GRID, and TCD-TIMIT datasets. Please note that the test subjects (and their recordings) were not seen in training. First column is the video with original audio (Ground Truth), second column is the same video but the audio was produced by a silent-video-to-speech method of Choi et al. (the original released model), third column's audio is produced by Choi et al. finetuned on our dataset. The final column contains video of the pseudo-GT reconstruction of the video, the audio recording is generated by our mesh-to-speech model from the depicted facial animation.
RAVDESS is a low-vocabulary dataset of emotional speech and song recordings. Both video-to-speech and mesh-to-speech models perform well on this dataset, likely thanks to the limited vocabulary.
GRID is a much larger scale dataset with many more subjects and richer vocabulary than RAVDESS, but the sentences present in the data are relatively artificial. Again, both video-to-speech and mesh-to-speech models perform well on this dataset.
TCD-TIMIT is a dataset of much richer vocabulary and more natural sentences than RAVDESS and GRID and presents a greater challenge for both video-to-speech and mesh-to-speech models. A finetuned video-to-speech model performs better than our mesh-to-speech model, but this is expected since our animations do not contain the teeth or the tongue, making the task inherently more ambiguous and difficult. Despite that, mesh-to-speech produces plausible sounds given the input animation's lip movements
Here we ablate different possibilities of input spaces for our mesh-to-speech model. We experiment with all face vertices (Face2Speech), FLAME expression vector (Exp2Speech) and mouth vertices (Mouth2Speech). While all models are applicable for talking head avatar supervision, we opt for Mouth2Speech. See the paper for more details.
In this section we demonstrate the utility of our mesh-to-speech model as a supervision mechanism for 3D talking head avatars.
This section shows the effect of mesh-to-speech as a supervision mechanism for talking head avatars (THUNDER). THUNDER (left) considerable improves the lip-sync of a diffusion model without mesh-to-speech (middle). Pseudo-GT reconstructions are shown on the right.
The following examples demonstrate that the model co-supervised with the mesh-to-speech loss achieves a higher fidelity lip animations, especially when it comes to the higher frequence motion. The model without the mesh-to-speech loss (middle) is not able to produce the same level of detail and accuracy in the lip movements, resulting in a less realistic and less expressive animation. The animation is of higher frequency, fidelity and does not skip syllables.
Here we compare THUNDER to the recent state-of-the-art methods for 3D talking head generation. Specifically, we use two version of THUNDER, one with frozen mesh-to-speech (THUNDER-F) and one with trainable mesh-to-speech (THUNDER-T). We focus on methods that also utilize FLAME and are trained on pseudo-GT data. Namely, we compare THUNDER to FlameFormer (a FLAME-based adaptation of FaceFormer), EMOTE and DiffPoseTalk.
Here we analyze the effect of different mesh-to-speech loss weights. We can observe progressively better lip-sync as we increase the weight of the mesh-to-speech loss.
Here we ablate the effect of different mesh-to-speech models. We find that mouth-to-speech produces the best lip-sync, likely thanks to its localized effect on the mouth region. See the paper for quantitative evaluation.
Nearly all recent methods finetune the input audio encoder or else they suffer from considerably impaired lip-sync. This results in a certain amount of overfitting to the training subjects' voices. We verify this by training to types of models, one with frozen Wav2Vec (THUNDER-F) and one with trainable Wav2Vec (THUNDER-T). We find that the models with trainable Wav2Vec exhibit better lip-sync but produce results of lower variety, which manifests as lower diversity scores compared to models with frozen Wav2Vec (THUNDER-F). In essence, the model's with trainable audio encoders become less stochastic and more deterministic. Remarkably, the application of the mesh-to-speech loss alleviates the necessity for finetuning Wav2Vec (THUNDER-T), producing significantly better lip-sync even without finetuning Wav2Vec (THUNDER-F), and dramatically reducing the number of training parameters. Applying mesh-to-speech along with finetuning Wav2Vec (THUNDER-T) results in further improvement of lip-sync metrics. For the quantiative evaluation, please see the paper and its supplementary material.
Here we show the diversity of results produced by THUNDER. We can observe that THUNDER is able to produce a wide variety of lip animations with correct lip-sync, and facial expressions (such as different mouth shapes, upper face expression and eyeblinks), which is a result of the stochastic nature of the diffusion model. Thanks to the mesh-to-speech loss, THUNDER's lip animations remain accurate.
Using THUNDER's Mesh-to-Speech model, with other methods, we can still obtain superior lip animation.
Here we apply the mesh-to-speech loss to our reimplementation of Media2Face, which allows us to control the emotion of the avatar with image prompts. We select a few representative images from the RAVDESS test set (displayed below). The application of the mesh-to-speech loss results in superior lip animation in this case as well.
If you find our work useful, please consider citing it in your work:
@misc{danecek2025supervising3dtalkinghead,
title={Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis},
author={Radek Daněček and Carolin Schmitt and Senya Polikovsky and Michael J. Black},
year={2025},
eprint={2504.13386},
archivePrefix={arXiv},
primaryClass={cs.GR},
url={https://arxiv.org/abs/2504.13386},
}