Audio2Rig: Artist-oriented Deep Learning Tool for Facial and Lip Sync Animation

Bastien Arcelin; Nicolas Chaverou

“Audio2Rig: Artist-oriented Deep Learning Tool for Facial and Lip Sync Animation” by Arcelin and Chaverou

Next: “Audiovisual Discourse in Digital Art” by... »

« Previous: “Audio-driven Violin Performance Animation With...

Conference:

SIGGRAPH 2024

Title:

Audio2Rig: Artist-oriented Deep Learning Tool for Facial and Lip Sync Animation

Session/Category Title: Lips Don't Lie

Presenter(s)/Author(s):

Bastien Arcelin

Nicolas Chaverou

Abstract:

Creating realistic or stylized facial and lip sync animation is a tedious task. It requires lot of time and skills to sync the lips with audio and convey the right emotion to the character’s face. To allow animators to spend more time on the artistic and creative part of the animation, we present Audio2Rig: a new deep learning based tool leveraging previously animated sequences of a show, to generate facial and lip sync rig animation from an audio file. Based in Maya, it learns from any production rig without any adjustment and generates high quality and stylized animations which mimic the style of the show. Audio2Rig fits in the animator workflow: since it generates keys on the rig controllers, the animation can be easily retaken. The method is based on 3 neural network modules which can learn an arbitrary number of controllers. Hence, different configurations can be created for specific parts of the face (such as the tongue, lips or eyes). With Audio2Rig, animators can also pick different emotions and adjust their intensities to experiment or customize the output, and have high level controls on the keyframes setting. Our method shows excellent results, generating fine animation details while respecting the show style. Finally, as the training relies on the studio data and is done internally, it ensures data privacy and prevents from copyright infringement.

References:

[1]
Kyunghyun Cho, Bart van Merrïenboer, and Yoshua Bengio. 2014. Neural Machine Translation: Encoder–Decoder Approaches. 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8). https://arxiv.org/abs/1409.1259v2.

[2]
Vladimir Ivanov and Parag Havaldar. 2023. Simplifying Facial Animation using Deep Learning based Phoneme Recognition. SIGGRAPH ’23: ACM SIGGRAPH 2023 Talks. https://dl.acm.org/doi/abs/10.1145/3587421.3595464.

[3]
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinnen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics. https://dl.acm.org/doi/10.1145/3072959.3073658.

[4]
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning Structured Output Representation using Deep Conditional Generative Models. NISP 2015. https://dl.acm.org/doi/10.5555/2969442.2969628.

ACM Digital Library Publication: