“Efficient Speech Animation Synthesis with Vocalic Lip Shapes” by Mima, Maejima and Morishima

  • ©Daisuke Mima, Akinobu Maejima, and Shigeo Morishima



Entry Number: 38


    Efficient Speech Animation Synthesis with Vocalic Lip Shapes



    Computer-generated speech animations are commonly seen in video games and movies. Although high-quality facial motions can be created by the hand crafted work of skilled artists, this approach is not always suitable because of time and cost constraints. A data-driven approach [Taylor et al. 2012], such as machine learning to concatenate video portions of speech training data, has been utilized to generate natural speech animation, while a large number of target shapes are often required for synthesis. We can obtain smooth mouth motions from prepared lip shapes for typical vowels by using an interpolation of lip shapes with Gaussian mixture models (GMMs) [Yano et al. 2007]. However, the resulting animation is not directly generated from the measured lip motions of someone’s actual speech.


    1. Taylor, S. L., et al. 2012. Dynamic Units of Visual Speech. In Proc. ACM SCA 2012, 275–284.
    2. Yano, A., et al. 2007. Variable Rate Speech Animation Synthesis. In Proc. ACM SIGGRAPH 2007, Poster, no. 18.
    3. Lee, A., et al. 2009. Recent Development of Open-source Speech Recognition Engine Julius. In Proc. APSIPA ASC 2009, 131–137.


ACM Digital Library Publication:

Overview Page: