THUNDER: Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis

In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync.

To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the orig inal input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync.

To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution^‡ Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talk ing avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop.

Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations.

‡ Elocution := the skill of clear and expressive speech, especially of distinct pronunciation and articulation.

Data

We train our models on a collection of 3D facial animations and corresponding audio recordings. We use three datasets, namely RAVDESS, GRID, and TCD-TIMIT. We employ an off-the-shelf face reconstruction model from the publicly available INFERNO library. We show a few reconstructions below, including the original videos, the overlayed reconstructions, separate reconstructions and the final unposed reconstructions (which were used in this work).

RAVDESS

GRID

TCD-TIMIT

Mesh-to-Speech

Using THUNDER's Mesh-to-Speech model, we can estimate spoken audio from a 3D animation of a face.

Architecture

The architecture of mesh-to-speech is based on the work of Choi et al., a known silent-video-to-speech model. The architecture consists of an input feature encoder, a conformer sequence encoder and two prediction heads, a speech unit classifier, and a mel spectrogram regressor. These two outputs contain enough information that a pretrained off-the-shelf vocoder from Choi et al. can turn them into output audio. Since the input is not a video, but a 3D animation, we replace the original lip video ResNet encoder that Choi et al. use with an MLP that takes 3D lip vertex coordinates as the input.

Results

We run video-to-speech and mesh-to-speech models on the test subjects of the RAVDESS, GRID, and TCD-TIMIT datasets. Please note that the test subjects (and their recordings) were not seen in training. First column is the video with original audio (Ground Truth), second column is the same video but the audio was produced by a silent-video-to-speech method of Choi et al. (the original released model), third column's audio is produced by Choi et al. finetuned on our dataset. The final column contains video of the pseudo-GT reconstruction of the video, the audio recording is generated by our mesh-to-speech model from the depicted facial animation.

Results on RAVDESS

RAVDESS is a low-vocabulary dataset of emotional speech and song recordings. Both video-to-speech and mesh-to-speech models perform well on this dataset, likely thanks to the limited vocabulary.

Ground Truth

Choi et al. (original)

Choi et al. (finetuned)

Mesh-to-Speech (ours)

Results on GRID

GRID is a much larger scale dataset with many more subjects and richer vocabulary than RAVDESS, but the sentences present in the data are relatively artificial. Again, both video-to-speech and mesh-to-speech models perform well on this dataset.

Ground Truth

Choi et al. (original)

Choi et al. (finetuned)

Mesh-to-Speech (ours)

Results on TCD-TIMIT

TCD-TIMIT is a dataset of much richer vocabulary and more natural sentences than RAVDESS and GRID and presents a greater challenge for both video-to-speech and mesh-to-speech models. A finetuned video-to-speech model performs better than our mesh-to-speech model, but this is expected since our animations do not contain the teeth or the tongue, making the task inherently more ambiguous and difficult. Despite that, mesh-to-speech produces plausible sounds given the input animation's lip movements

Ground Truth

Choi et al. (original)

Choi et al. (finetuned)

Mesh-to-Speech (ours)

Ablation of input spaces

Here we ablate different possibilities of input spaces for our mesh-to-speech model. We experiment with all face vertices (Face2Speech), FLAME expression vector (Exp2Speech) and mouth vertices (Mouth2Speech). While all models are applicable for talking head avatar supervision, we opt for Mouth2Speech. See the paper for more details.

Ground Truth

Face2Speech

Exp2Speech

Mouth2Speech (final)

Talking Head Avatars with Mesh-to-Speech

In this section we demonstrate the utility of our mesh-to-speech model as a supervision mechanism for 3D talking head avatars.

Architecture of THUNDER

THUNDER is a diffusion-based 3D speech-driven talking head avatar framework which utilizes a mesh-to-speech model as a supervision mechanism to achieve accurate lip-sync and diverse facial expressions. Mesh-To-Speech Architecture

Results

This section shows the effect of mesh-to-speech as a supervision mechanism for talking head avatars (THUNDER). THUNDER (left) considerable improves the lip-sync of a diffusion model without mesh-to-speech (middle). Pseudo-GT reconstructions are shown on the right.

The following examples demonstrate that the model co-supervised with the mesh-to-speech loss achieves a higher fidelity lip animations, especially when it comes to the higher frequence motion. The model without the mesh-to-speech loss (middle) is not able to produce the same level of detail and accuracy in the lip movements, resulting in a less realistic and less expressive animation. The animation is of higher frequency, fidelity and does not skip syllables.

THUNDER-F

THUNDER-F
w/o mesh-to-speech

Ground Truth

Other methods with Mesh-to-speech.

Using THUNDER's Mesh-to-Speech model, with other methods, we can still obtain superior lip animation.

FlameFormer

Here we augment the FlameFormer architecture (an adaptation of FaceFormer) with mesh-to-speech. We do so with both frozen audio encoder (FlameFormer-F) and trainable audio encoder (FlameFormer-T). In both cases, we observe that the application of mesh-to-speech results in superior lip animation. This demonstrates that even deterministic methods with generally satisfactory lip animations can benefit from the mesh-to-speech loss.

FlameFormer-F

FlameFormer-F
with mesh-to-speech

FlameFormer-T

FlameFormer-T
with mesh-to-speech

Reimplementation of Media2Face

Here we apply the mesh-to-speech loss to our reimplementation of Media2Face, which allows us to control the emotion of the avatar with image prompts. We select a few representative images from the RAVDESS test set (displayed below). The application of the mesh-to-speech loss results in superior lip animation in this case as well.

happy

angry

fearful

disgusted

calm

sad

surprised

Media2Face with Mesh-to-Speech

Media2Face without Mesh-to-Speech

Media2Face with Mesh-to-Speech

Media2Face without Mesh-to-Speech

BibTeX

If you find our work useful, please consider citing it in your work:

@misc{danecek2025supervising3dtalkinghead,
      title={Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis}, 
      author={Radek Daněček and Carolin Schmitt and Senya Polikovsky and Michael J. Black},
      year={2025},
      eprint={2504.13386},
      archivePrefix={arXiv},
      primaryClass={cs.GR},
      url={https://arxiv.org/abs/2504.13386}, 
}

THUNDER: Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis

Abstract

Video

Data

Mesh-to-Speech

Architecture

Results

Results on RAVDESS

Results on GRID

Results on TCD-TIMIT

Ablation of input spaces

Talking Head Avatars with Mesh-to-Speech

Architecture of THUNDER

Results

Comparison to SOTA

Effect of Mesh-To-Speech Loss Strength

Effect of different mesh-to-speech types

Frozen vs Trainable Audio Encoder

Diversity of Results

Other methods with Mesh-to-speech.

FlameFormer

Reimplementation of Media2Face

BibTeX