THUNDER: Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis

Max Planck Institute for Intelligent Systems
Teaser Image

THUNDER introduces a new paradigm for stochastic generation of 3D talking head avatars from speech with accurate lip articulation and diverse facial expressions. Given an input audio, a 3D animation is generated. The animation is then fed into an audio synthesis (mesh-to-speech) model, which generates an output audio representation. The input and output audio representations are compared, creating a novel audio-consistency supervision loop, which we coin as analysis-by-audio-synthesis.

Abstract

In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync.

To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the orig inal input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync.

To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talk ing avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop.

Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations.


Elocution := the skill of clear and expressive speech, especially of distinct pronunciation and articulation.

Video

Data

We train our models on a collection of 3D facial animations and corresponding audio recordings. We use three datasets, namely RAVDESS, GRID, and TCD-TIMIT. We employ an off-the-shelf face reconstruction model from the publicly available INFERNO library. We show a few reconstructions below, including the original videos, the overlayed reconstructions, separate reconstructions and the final unposed reconstructions (which were used in this work).

RAVDESS
GRID
TCD-TIMIT


Mesh-to-Speech

Using THUNDER's Mesh-to-Speech model, we can estimate spoken audio from a 3D animation of a face.

Architecture

The architecture of mesh-to-speech is based on the work of Choi et al., a known silent-video-to-speech model. The architecture consists of an input feature encoder, a conformer sequence encoder and two prediction heads, a speech unit classifier, and a mel spectrogram regressor. These two outputs contain enough information that a pretrained off-the-shelf vocoder from Choi et al. can turn them into output audio. Since the input is not a video, but a 3D animation, we replace the original lip video ResNet encoder that Choi et al. use with an MLP that takes 3D lip vertex coordinates as the input.

Mesh-To-Speech Architecture

Results

We run video-to-speech and mesh-to-speech models on the test subjects of the RAVDESS, GRID, and TCD-TIMIT datasets. Please note that the test subjects (and their recordings) were not seen in training. First column is the video with original audio (Ground Truth), second column is the same video but the audio was produced by a silent-video-to-speech method of Choi et al. (the original released model), third column's audio is produced by Choi et al. finetuned on our dataset. The final column contains video of the pseudo-GT reconstruction of the video, the audio recording is generated by our mesh-to-speech model from the depicted facial animation.


Results on RAVDESS

RAVDESS is a low-vocabulary dataset of emotional speech and song recordings. Both video-to-speech and mesh-to-speech models perform well on this dataset, likely thanks to the limited vocabulary.

Ground Truth
Choi et al. (original)
Choi et al. (finetuned)
Mesh-to-Speech (ours)

Results on GRID

GRID is a much larger scale dataset with many more subjects and richer vocabulary than RAVDESS, but the sentences present in the data are relatively artificial. Again, both video-to-speech and mesh-to-speech models perform well on this dataset.

Ground Truth
Choi et al. (original)
Choi et al. (finetuned)
Mesh-to-Speech (ours)

Results on TCD-TIMIT

TCD-TIMIT is a dataset of much richer vocabulary and more natural sentences than RAVDESS and GRID and presents a greater challenge for both video-to-speech and mesh-to-speech models. A finetuned video-to-speech model performs better than our mesh-to-speech model, but this is expected since our animations do not contain the teeth or the tongue, making the task inherently more ambiguous and difficult. Despite that, mesh-to-speech produces plausible sounds given the input animation's lip movements

Ground Truth
Choi et al. (original)
Choi et al. (finetuned)
Mesh-to-Speech (ours)

Ablation of input spaces

Here we ablate different possibilities of input spaces for our mesh-to-speech model. We experiment with all face vertices (Face2Speech), FLAME expression vector (Exp2Speech) and mouth vertices (Mouth2Speech). While all models are applicable for talking head avatar supervision, we opt for Mouth2Speech. See the paper for more details.


Ground Truth
Face2Speech
Exp2Speech
Mouth2Speech (final)


Talking Head Avatars with Mesh-to-Speech

In this section we demonstrate the utility of our mesh-to-speech model as a supervision mechanism for 3D talking head avatars.

Architecture of THUNDER

THUNDER is a diffusion-based 3D speech-driven talking head avatar framework which utilizes a mesh-to-speech model as a supervision mechanism to achieve accurate lip-sync and diverse facial expressions. Mesh-To-Speech Architecture

Results

This section shows the effect of mesh-to-speech as a supervision mechanism for talking head avatars (THUNDER). THUNDER (left) considerable improves the lip-sync of a diffusion model without mesh-to-speech (middle). Pseudo-GT reconstructions are shown on the right.


The following examples demonstrate that the model co-supervised with the mesh-to-speech loss achieves a higher fidelity lip animations, especially when it comes to the higher frequence motion. The model without the mesh-to-speech loss (middle) is not able to produce the same level of detail and accuracy in the lip movements, resulting in a less realistic and less expressive animation. The animation is of higher frequency, fidelity and does not skip syllables.


THUNDER-F
THUNDER-F
w/o mesh-to-speech
Ground Truth

Comparison to SOTA

Here we compare THUNDER to the recent state-of-the-art methods for 3D talking head generation. Specifically, we use two version of THUNDER, one with frozen mesh-to-speech (THUNDER-F) and one with trainable mesh-to-speech (THUNDER-T). We focus on methods that also utilize FLAME and are trained on pseudo-GT data. Namely, we compare THUNDER to FlameFormer (a FLAME-based adaptation of FaceFormer), EMOTE and DiffPoseTalk.

FlameFormer-F
(our data)
EMOTE
(MEAD)
DiffPoseTalk
(TFHP)
THUNDER-F
(our data)
THUNDER-T
(our data)
Pseudo-GT

Effect of Mesh-To-Speech Loss Strength

Here we analyze the effect of different mesh-to-speech loss weights. We can observe progressively better lip-sync as we increase the weight of the mesh-to-speech loss.

wm2s = 0
wm2s = 0.5
wm2s = 1.0
(THUNDER-F)
wm2s = 5.0
wm2s = 10.0

Effect of different mesh-to-speech types

Here we ablate the effect of different mesh-to-speech models. We find that mouth-to-speech produces the best lip-sync, likely thanks to its localized effect on the mouth region. See the paper for quantitative evaluation.

No mesh-to-speech
Face2Speech
(all face vertices)
Exp2Speech
(FLAME expression)
Mouth2Speech
(mouth vertices)
Pesudo-GT

Frozen vs Trainable Audio Encoder

Nearly all recent methods finetune the input audio encoder or else they suffer from considerably impaired lip-sync. This results in a certain amount of overfitting to the training subjects' voices. We verify this by training to types of models, one with frozen Wav2Vec (THUNDER-F) and one with trainable Wav2Vec (THUNDER-T). We find that the models with trainable Wav2Vec exhibit better lip-sync but produce results of lower variety, which manifests as lower diversity scores compared to models with frozen Wav2Vec (THUNDER-F). In essence, the model's with trainable audio encoders become less stochastic and more deterministic. Remarkably, the application of the mesh-to-speech loss alleviates the necessity for finetuning Wav2Vec (THUNDER-T), producing significantly better lip-sync even without finetuning Wav2Vec (THUNDER-F), and dramatically reducing the number of training parameters. Applying mesh-to-speech along with finetuning Wav2Vec (THUNDER-T) results in further improvement of lip-sync metrics. For the quantiative evaluation, please see the paper and its supplementary material.

wm2s = 0
wm2s = 0.5
wm2s = 1.0
THUNDER-F
wm2s = 5.0
wm2s = 10.0
wm2s = 0
wm2s = 0.5
wm2s = 1.0
THUNDER-T
wm2s = 5.0
wm2s = 10.0

wm2s = 0
wm2s = 0.5
wm2s = 1.0
THUNDER-F
wm2s = 5.0
wm2s = 10.0
wm2s = 0
wm2s = 0.5
wm2s = 1.0
THUNDER-T
wm2s = 5.0
wm2s = 10.0

Diversity of Results

Here we show the diversity of results produced by THUNDER. We can observe that THUNDER is able to produce a wide variety of lip animations with correct lip-sync, and facial expressions (such as different mouth shapes, upper face expression and eyeblinks), which is a result of the stochastic nature of the diffusion model. Thanks to the mesh-to-speech loss, THUNDER's lip animations remain accurate.









Other methods with Mesh-to-speech.

Using THUNDER's Mesh-to-Speech model, with other methods, we can still obtain superior lip animation.

FlameFormer

Here we augment the FlameFormer architecture (an adaptation of FaceFormer) with mesh-to-speech. We do so with both frozen audio encoder (FlameFormer-F) and trainable audio encoder (FlameFormer-T). In both cases, we observe that the application of mesh-to-speech results in superior lip animation. This demonstrates that even deterministic methods with generally satisfactory lip animations can benefit from the mesh-to-speech loss.
FlameFormer-F
FlameFormer-F
with mesh-to-speech
FlameFormer-T
FlameFormer-T
with mesh-to-speech


Reimplementation of Media2Face

Here we apply the mesh-to-speech loss to our reimplementation of Media2Face, which allows us to control the emotion of the avatar with image prompts. We select a few representative images from the RAVDESS test set (displayed below). The application of the mesh-to-speech loss results in superior lip animation in this case as well.

happy
angry
fearful
disgusted
calm
sad
surprised
Media2Face with Mesh-to-Speech
Media2Face without Mesh-to-Speech


Media2Face with Mesh-to-Speech
Media2Face without Mesh-to-Speech

BibTeX

If you find our work useful, please consider citing it in your work:


@misc{danecek2025supervising3dtalkinghead,
      title={Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis}, 
      author={Radek Daněček and Carolin Schmitt and Senya Polikovsky and Michael J. Black},
      year={2025},
      eprint={2504.13386},
      archivePrefix={arXiv},
      primaryClass={cs.GR},
      url={https://arxiv.org/abs/2504.13386}, 
}