“Gesture controllers” by Levine, Krähenbühl, Thrun and Koltun

  • ©Sergey Levine, Philipp Krähenbühl, Sebastian Thrun, and Vladlen Koltun




    Gesture controllers



    We introduce gesture controllers, a method for animating the body language of avatars engaged in live spoken conversation. A gesture controller is an optimal-policy controller that schedules gesture animations in real time based on acoustic features in the user’s speech. The controller consists of an inference layer, which infers a distribution over a set of hidden states from the speech signal, and a control layer, which selects the optimal motion based on the inferred state distribution. The inference layer, consisting of a specialized conditional random field, learns the hidden structure in body language style and associates it with acoustic features in speech. The control layer uses reinforcement learning to construct an optimal policy for selecting motion clips from a distribution over the learned hidden states. The modularity of the proposed method allows customization of a character’s gesture repertoire, animation of non-human characters, and the use of additional inputs such as speech recognition or direct user control.


    1. Albrecht, I., Haber, J., and peter Seidel, H. 2002. Automatic generation of non-verbal facial expressions from speech. In Computer Graphics International, 283–293.Google Scholar
    2. Bailenson, J. N., Beall, A. C., Loomis, J., Blascovich, J., and Turk, M. 2004. Transformed social interaction: Decoupling representation from behavior and form in collaborative virtual environments. Presence: Teleoperators and Virtual Environments 13, 4, 428–441. Google ScholarDigital Library
    3. Bertsekas, D. 2007. Dynamic Programming and Optimal Control, third ed. Athena Scientific. Google ScholarDigital Library
    4. Birdwhistell, R. 1952. Introduction to Kinesics. Department of State Foreign Service Institute, Washington, DC.Google Scholar
    5. Brand, M. 1999. Voice puppetry. In SIGGRAPH ’99: ACM SIGGRAPH 1999 papers, ACM, New York, NY, USA, 21–28. Google ScholarDigital Library
    6. Bregler, C., Covell, M., and Slaney, M. 1997. Video rewrite: driving visual speech with audio. In SIGGRAPH ’97: ACM SIGGRAPH 1997 Papers, ACM, New York, NY, USA, 353–360. Google ScholarDigital Library
    7. Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., Douville, B., Prevost, S., and Stone, M. 1994. Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In SIGGRAPH ’94: ACM SIGGRAPH 1994 Papers, ACM, New York, NY, USA, 413–420. Google ScholarDigital Library
    8. Cassell, J., Vilhjálmsson, H. H., and Bickmore, T. 2001. Beat: the behavior expression animation toolkit. In SIGGRAPH ’01: ACM SIGGRAPH 2001 papers, ACM, New York, NY, USA, 477–486. Google ScholarDigital Library
    9. Chuang, E., and Bregler, C. 2005. Mood swings: expressive speech animation. ACM Transactions on Graphics 24, 2, 331–347. Google ScholarDigital Library
    10. de Meijer, M. 1989. The contribution of general features of body movement to the attribution of emotions. Journal of Nonverbal Behavior 13, 4, 247–268.Google ScholarCross Ref
    11. Deng, Z., and Neumann, U. 2007. Data-Driven 3D Facial Animation. Springer-Verlag Press. Google ScholarDigital Library
    12. Dobrogaev, S. M. 1931. Ucenie o reflekse v problemakh jazykovedenija. {Observations on reflex in aspects of language study.}. Jazykovedenie i Materializm 2, 105–173.Google Scholar
    13. Efron, D. 1972. Gesture, Race and Culture. The Hague: Mouton.Google Scholar
    14. Englebienne, G., Cootes, T., and Rattray, M. 2007. A probabilistic model for generating realistic lip movements from speech. In Neural Information Processing Systems (NIPS) 19, MIT Press.Google Scholar
    15. Feyereisen, P., and de Lannoy, J.-D. 1991. Gestures and Speech: Psychological Investigations. Cambridge University Press.Google Scholar
    16. Hartmann, B., Mancini, M., and Pelachaud, C. 2002. Formational parameters and adaptive prototype instantiation for mpeg-4 compliant gesture synthesis. In Proceedings on Computer Animation, IEEE Computer Society, Washington, DC, USA, 111. Google ScholarDigital Library
    17. Hartmann, B., Mancini, M., and Pelachaud, C. 2005. Implementing expressive gesture synthesis for embodied conversational agents. In Gesture Workshop, Springer, 188–199. Google ScholarDigital Library
    18. Kendon, A. 2004. Gesture — Visible Action as Utterance. Cambridge University Press, New York, NY, USA.Google Scholar
    19. Kipp, M., Neff, M., and Albrecht, I. 2007. An annotation scheme for conversational gestures: How to economically capture timing and form. Language Resources and Evaluation 41, 3–4, 325–339.Google ScholarCross Ref
    20. Kopp, S., and Wachsmuth, I. 2004. Synthesizing multimodal utterances for conversational agents: Research articles. Computer Animation and Virtual Worlds 15, 1, 39–52. Google ScholarDigital Library
    21. Lafferty, J. D., McCallum, A., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th International Conference on Machine Learning, Morgan Kaufmann Inc., 282–289. Google ScholarDigital Library
    22. Levine, S., Theobalt, C., and Koltun, V. 2009. Real-time prosody-driven synthesis of body language. In SIGGRAPH Asia ’09: ACM SIGGRAPH Asia 2009 papers, ACM, New York, NY, USA. Google ScholarDigital Library
    23. Li, Y., and Shum, H.-Y. 2006. Learning dynamic audio-visual mapping with input-output hidden Markov models. IEEE Transactions on Multimedia 8, 3, 542–549. Google ScholarDigital Library
    24. Loehr, D. 2007. Aspects of rhythm in gesture and speech. Gesture 7, 2, 179–214.Google ScholarCross Ref
    25. McCann, J., and Pollard, N. 2007. Responsive characters from motion fragments. In SIGGRAPH ’07: ACM SIGGRAPH 2007 papers, ACM, New York, NY, USA. Google ScholarDigital Library
    26. McNeill, D. 1992. Hand and Mind: What Gestures Reveal About Thought. University Of Chicago Press.Google Scholar
    27. Morency, L.-P., Quattoni, A., and Darrell, T. 2007. Latent-dynamic discriminative models for continuous gesture recognition. In Proc. of IEEE Computer Vision and Pattern Recognition, 1–8.Google Scholar
    28. Neff, M., Kipp, M., Albrecht, I., and Seidel, H.-P. 2008. Gesture modeling and animation based on a probabilistic recreation of speaker style. ACM Transactions on Graphics 27, 1, 1–24. Google ScholarDigital Library
    29. Newlove, J. 1993. Laban for Actors and Dancers. Routledge Nick Hern Books, New York, NY, USA.Google Scholar
    30. Perlin, K., and Goldberg, A. 1996. Improv: a system for scripting interactive actors in virtual worlds. In SIGGRAPH ’96: ACM SIGGRAPH 1996 Papers, ACM, 205–216. Google ScholarDigital Library
    31. Rabiner, L. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 2, 257–286.Google ScholarCross Ref
    32. Sargin, M. E., Yemez, Y., Erzin, E., and Tekalp, A. M. 2008. Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 8, 1330–1345. Google ScholarDigital Library
    33. Shröder, M. 2009. Expressive speech synthesis: Past, present, and possible futures. Affective Information Processing, 111–126.Google Scholar
    34. Stone, M., DeCarlo, D., Oh, I., Rodriguez, C., Stere, A., Lees, A., and Bregler, C. 2004. Speaking with hands: creating animated conversational characters from recordings of human performance. In SIGGRAPH ’04: ACM SIGGRAPH 2004 Papers, ACM, New York, NY, USA, 506–513. Google ScholarDigital Library
    35. The CMU Sphinx Group, 2007. Open source speech recognition engines.Google Scholar
    36. Treuille, A., Lee, Y., and Popović, Z. 2007. Near-optimal character animation with continuous control. In SIGGRAPH ’07: ACM SIGGRAPH 2007 Papers, ACM, New York, NY, USA. Google ScholarDigital Library
    37. Valbonesi, L., Ansari, R., McNeill, D., Quek, F., S. Duncan, K. E. M., and Bryll, R. 2002. Multimodal signal analysis of prosody and hand motion: Temporal correlation of speech and gestures. In EUSIPCO ’02, vol. 1, 75–78.Google Scholar
    38. Wang, S. B., Quattoni, A., Morency, L.-P., Demirdjian, D., and Darrell, T. 2006. Hidden conditional random fields for gesture recognition. In Computer Vision and Pattern Recognition, 1521–1527. Google ScholarDigital Library
    39. Xue, J., Borgstrom, J., Jiang, J., Bernstein, L., and Alwan, A. 2006. Acoustically-driven talking face synthesis using dynamic Bayesian networks. IEEE International Conference on Multimedia and Expo, 1165–1168.Google Scholar
    40. Zhao, L., and Badler, N. I. 2005. Acquiring and validating motion qualities from live limb gestures. Graphical Models 67, 1, 1–16. Google ScholarDigital Library

ACM Digital Library Publication: