AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding

The weights and code are being organized, and we will make them public as soon as possible.
Thank you for your attention. The paper is currently under peer review, and there may still be minor changes. We will update this repository after the official publication.

Environment Installation

conda create -n anitalker python==3.9.0
conda activate anitalker
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
pip install -r requirements.txt

Model Zoo

Please download the checkpoint and place them into the folder ckpts

Run the demo

Face facing forward

Keep pose_yaw, pose_pitch, pose_roll to zero.

Demo script:

python ./code/demo_audio_generation.py \
    --infer_type 'mfcc_pose_only' \
    --stage1_checkpoint_path 'ckpts/stage1.ckpt' \
    --stage2_checkpoint_path 'ckpts/stage2_pose_only.ckpt' \
    --test_image_path 'test_demos/portraits/monalisa.jpg' \
    --test_audio_path 'test_demos/audios/english_female.wav' \
    --result_path 'results/monalisa_case1/' \
    --control_flag True \
    --seed 0 \
    --pose_yaw 0 \
    --pose_pitch 0 \
    --pose_roll 0

Adjust the orientation

Changing pose_yaw from 0 to 0.25

Demo script:

python ./code/demo.py \
    --infer_type 'mfcc_pose_only' \
    --stage1_checkpoint_path 'ckpts/stage1.ckpt' \
    --stage2_checkpoint_path 'ckpts/stage2_pose_only.ckpt' \
    --test_image_path 'test_demos/portraits/monalisa.jpg' \
    --test_audio_path 'test_demos/audios/english_female.wav' \
    --result_path 'results/monalisa_case2/' \
    --control_flag True \
    --seed 0 \
    --pose_yaw 0.25 \
    --pose_pitch 0 \
    --pose_roll 0

Talking in Free-style

Demo script:

python ./code/demo.py \
    --infer_type 'mfcc_pose_only' \
    --stage1_checkpoint_path 'ckpts/stage1.ckpt' \
    --stage2_checkpoint_path 'ckpts/stage2_pose_only.ckpt' \
    --test_image_path 'test_demos/portraits/monalisa.jpg' \
    --test_audio_path 'test_demos/audios/english_female.wav' \
    --result_path 'results/monalisa_case3/'

More Scripts

See MORE_SCRIPTS

Some Advice and Questions

1. Using similar poses to the portrait (Best Practice)

To avoid potential deformation issues, it is recommended to keep the generated face angle close to the original portrait angle. For instance, if the face in the portrait is initially rotated to the left, it is advisable to use a value for yaw between -1 and 0 (-90 to 0 degrees). When the difference in angle from the portrait is significant, the generated face may appear distorted.

2. Utilizing algorithms to automatically extract or control using other faces' angles

If you need to automate face control, you can employ pose extraction algorithms to achieve this, such as extracting the pose of another person to drive the portrait. The algorithms for extraction have been open-sourced and can be found at this link.

3. What are the differences between MFCC and Hubert features?

Both `MFCC` and `Hubert` are front-end features for speech, used to extract audio signals. However, `Hubert` features require more environmental dependencies and occupy a significant amount of disk space. To facilitate quick inference for everyone, we have replaced this feature with a lightweight alternative (MFCC). The rest of the code remains unchanged. We have observed that MFCC converges more easily but may be inferior in terms of expressiveness compared to Hubert. If you need to extract Hubert features, please refer to this link. Considering the highly lifelike nature of the generated results, we currently do not plan to release the weights based on Hubert.

Citation

@misc{liu2024anitalker,
      title={AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding}, 
      author={Tao Liu and Feilong Chen and Shuai Fan and Chenpeng Du and Qi Chen and Xie Chen and Kai Yu},
      year={2024},
      eprint={2405.03121},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgments

We would like to express our sincere gratitude to the numerous prior works that have laid the foundation for the development of AniTalker.

Stage 1, which primarily focuses on training the motion encoder and the rendering module, heavily relies on resources from LIA. The second stage of diffusion training is built upon diffae and espnet. For the computation of mutual information loss, we implement methods from CLUB and utilize AAM-softmax in the training of face recognition. Moreover, we leverage the pretrained Hubert model provided by TencentGameMate.

Additionally, we employ 3DDFA_V2 to extract head pose and torchlm to obtain face landmarks, which are used to calculate face location and scale. We have already open-sourced the code usage for these preprocessing steps at talking_face_preprocessing. We acknowledge the importance of building upon existing knowledge and are committed to contributing back to the research community by sharing our findings and code.

Disclaimer

This library's code is not a formal product, and we have not tested all use cases; therefore, it cannot be directly offered to end-service customers.
The main purpose of making our code public is to facilitate academic demonstrations and communication. Any use of this code to spread harmful information is strictly prohibited.
Please use this library in compliance with the terms specified in the license file and avoid improper use.
When using the code, please follow and abide by local laws and regulations.
During the use of this code, you will bear the corresponding responsibility. Our company (AISpeech Ltd.) is not responsible for the generated results.