Real-time prosody-driven synthesis of body language

Human communication involves not only speech, but also a wide variety of gestures and body motions. Interactions in virtual environments often lack this multi-modal aspect of communication. We present a method for automatically synthesizing body language animations directly from the participants’ speech signals, without the need for additional input. Our system generates appropriate body language animations by selecting segments from motion capture data of real people in conversation. The synthesis can be performed progressively, with no advance knowledge of the utterance, making the system suitable for animating characters from live human speech. The selection is driven by a hidden Markov model and uses prosody-based features extracted from speech. The training phase is fully automatic and does not require hand-labeling of input data, and the synthesis phase is efficient enough to run in real time on live microphone input. User studies confirm that our method is able to produce realistic and compelling body language.

References:

1. Adolphs, R. 2002. Neural systems for recognizing emotion. Current Opinion in Neurobiology 12, 2, 169–177.Google ScholarCross Ref
2. Albrecht, I., Haber, J., and peter Seidel, H. 2002. Automatic generation of non-verbal facial expressions from speech. In Computer Graphics International, 283–293.Google Scholar
3. Arikan, O., and Forsyth, D. A. 2002. Interactive motion generation from examples. ACM Transactions on Graphics 21, 3, 483–490. Google ScholarDigital Library
4. Barbič, J., Safonova, A., Pan, J.-Y., Faloutsos, C., Hodgins, J. K., and Pollard, N. S. 2004. Segmenting motion capture data into distinct behaviors. In Proceedings of Graphics Interface 2004, 185–194. Google ScholarDigital Library
5. Beskow, J. 2003. Talking Heads – Models and Applications for Multimodal Speech Synthesis. PhD thesis, KTH Stockholm.Google Scholar
6. Boersma, P. 1993. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the Institute of Phonetic Sciences, vol. 17, 97–110.Google Scholar
7. Boersma, P. 2001. Praat, a system for doing phonetics by computer. Glot International 5, 9–10, 314–345.Google Scholar
8. Brand, M. 1999. Voice puppetry. In SIGGRAPH ’99: ACM SIGGRAPH 1999 Papers, ACM, New York, NY, USA, 21–28. Google ScholarDigital Library
9. Bregler, C., Covell, M., and Slaney, M. 1997. Video rewrite: driving visual speech with audio. In SIGGRAPH ’97: ACM SIGGRAPH 1997 Papers, ACM, New York, NY, USA, 353–360. Google ScholarDigital Library
10. Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., Douville, B., Prevost, S., and Stone, M. 1994. Animated conversation: rule-based generation of facial expression, gesture&spoken intonation for multiple conversational agents. In SIGGRAPH ’94: ACM SIGGRAPH 1994 Papers, ACM, New York, NY, USA, 413–420. Google ScholarDigital Library
11. Cassell, J., Vilhjálmsson, H. H., and Bickmore, T. 2001. Beat: the behavior expression animation toolkit. In SIGGRAPH ’01: ACM SIGGRAPH 2001 Papers, 477–486. Google ScholarDigital Library
12. Cassell, J. 2000. Embodied conversational interface agents. Communications of the ACM 43, 4, 70–78. Google ScholarDigital Library
13. Chuang, E., and Bregler, C. 2005. Mood swings: expressive speech animation. ACM Transactions on Graphics 24, 2, 331–347. Google ScholarDigital Library
14. Ezzat, T., Geiger, G., and Poggio, T. 2002. Trainable videorealistic speech animation. In SIGGRAPH ’02: ACM SIGGRAPH 2002 Papers, ACM, New York, NY, USA, 388–398. Google ScholarDigital Library
15. Fod, A., Mataric, M. J., and Jenkins, O. C. 2002. Automated derivation of primitives for movement classification. Autonomous Robots 12, 39–54. Google ScholarDigital Library
16. Hartmann, B., Mancini, M., and Pelachaud, C. 2002. Formational parameters and adaptive prototype instantiation for mpeg-4 compliant gesture synthesis. In Proceedings on Computer Animation, IEEE Computer Society, Washington, DC, USA, 111. Google ScholarDigital Library
17. Jensen, C., Farnham, S. D., Drucker, S. M., and Kollock, P. 2000. The effect of communication modality on cooperation in online environments. In Proceedings of CHI ’00, ACM, New York, NY, USA, 470–477. Google ScholarDigital Library
18. Ju, E., and Lee, J. 2008. Expressive facial gestures from motion capture data. Computer Graphics Forum 27, 2, 381–388.Google ScholarCross Ref
19. Kendon, A. 2004. Gesture – Visible Action as Utterance. Cambridge University Press, New York, NY, USA.Google Scholar
20. Kipp, M., Neff, M., and Albrecht, I. 2007. An annotation scheme for conversational gestures: how to economically capture timing and form. Language Resources and Evaluation 41, 3–4 (December), 325–339.Google ScholarCross Ref
21. Kopp, S., and Wachsmuth, I. 2004. Synthesizing multimodal utterances for conversational agents: Research articles. Computer Animation and Virtual Worlds 15, 1, 39–52. Google ScholarDigital Library
22. Kovar, L., Gleicher, M., and Pighin, F. 2002. Motion graphs. ACM Transactions on Graphics 21, 3, 473–482. Google ScholarDigital Library
23. Krauss, R. M., Dushay, R. A., Chen, Y., and Rauscher, F. 1995. The communicative value of conversational hand gestur. Journal of Experimental Social Psychology 31, 6, 533–552.Google ScholarCross Ref
24. Lee, J., Chai, J., Reitsma, P. S. A., Hodgins, J. K., and Pollard, N. S. 2002. Interactive control of avatars animated with human motion data. ACM Transactions on Graphics 21, 3, 491–500. Google ScholarDigital Library
25. Li, Y., and Shum, H.-Y. 2006. Learning dynamic audio-visual mapping with input-output hidden markov models. IEEE Transactions on Multimedia 8, 3, 542–549. Google ScholarDigital Library
26. Li, Y., Wang, T., and Shum, H.-Y. 2002. Motion texture: a two-level statistical model for character motion synthesis. In SIGGRAPH ’02: ACM SIGGRAPH 2002 Papers, ACM, New York, NY, USA, 465–472. Google ScholarDigital Library
27. Loehr, D. 2004. Gesture and Intonation. PhD thesis, Georgetown University.Google Scholar
28. Maeran, O., Piuri, V., and Storti Gajani, G. 1997. Speech recognition through phoneme segmentation and neural classification. Instrumentation and Measurement Technology Conference, 1997. IMTC/97. Proceedings. ‘Sensing, Processing, Networking.’, IEEE 2 (May), 1215–1220.Google ScholarCross Ref
29. Majkowska, A., Zordan, V. B., and Faloutsos, P. 2006. Automatic splicing for hand and body animations. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer animation, 309–316. Google ScholarDigital Library
30. McNeill, D. 1992. Hand and Mind: What Gestures Reveal About Thought. University Of Chicago Press.Google Scholar
31. McNeill, D. 2005. Gesture and Thought. University Of Chicago Press, November.Google Scholar
32. Montepare, J., Koff, E., Zaitchik, D., and Albert, M. 1999. The use of body movements and gestures as cues to emotions in younger and older adults. Journal of Nonverbal Behavior 23, 2, 133–152.Google ScholarCross Ref
33. Morency, L.-P., Sidner, C., Lee, C., and Darrell, T. 2007. Head gestures for perceptual interfaces: The role of context in improving recognition. Artificial Intelligence 171, 8–9, 568–585. Google ScholarDigital Library
34. Müller, M., Röder, T., and Clausen, M. 2005. Efficient content-based retrieval of motion capture data. In SIGGRAPH ’05: ACM SIGGRAPH 2005 Papers, ACM, New York, NY, USA, vol. 24, 677–685. Google ScholarDigital Library
35. Neff, M., Kipp, M., Albrecht, I., and Seidel, H.-P. 2008. Gesture modeling and animation based on a probabilistic recreation of speaker style. ACM Transactions on Graphics 27, 1, 5:1–24. Google ScholarDigital Library
36. Park, J., and Ko, H. 2008. Real-time continuous phoneme recognition system using class-dependent tied-mixture hmm with hbt structure for speech-driven lip-sync. IEEE Transactions on Multimedia 10, 7 (Nov.), 1299–1306. Google ScholarDigital Library
37. Perlin, K., and Goldberg, A. 1996. Improv: a system for scripting interactive actors in virtual worlds. In SIGGRAPH ’96: ACM SIGGRAPH 1996 Papers, ACM, 205–216. Google ScholarDigital Library
38. Perlin, K. 1997. Layered compositing of facial expression. In SIGGRAPH ’97: ACM SIGGRAPH 97 Visual Proceedings, ACM, New York, NY, USA, 226–227. Google ScholarDigital Library
39. Rabiner, L. 1989. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 2, 257–286.Google ScholarCross Ref
40. Ramshaw, L. A., and Marcus, M. P. 1995. Text chunking using transformation-based learning. In Proceedings of the Third Workshop on Very Large Corpora, 82–94.Google Scholar
41. Sargin, M. E., Erzin, E., Yemez, Y., Tekalp, A. M., Erdem, A., Erdem, C., and Ozkan, M. 2007. Prosody-driven head-gesture animation. In ICASSP ’07: IEEE International Conference on Acoustics, Speech, and Signal Processing.Google Scholar
42. Scherer, K. R., Banse, R., Wallbott, H. G., and Goldbeck, T. 1991. Vocal cues in emotion encoding and decoding. Motivation and Emotion 15, 2, 123–148.Google ScholarCross Ref
43. Schröder, M. 2004. Speech and Emotion Research: An overview of research frameworks and a dimensional approach to emotional speech synthesis. PhD thesis, Phonus 7, Research Report of the Institute of Phonetics, Saarland University.Google Scholar
44. Shoemake, K. 1985. Animating rotation with quaternion curves. In SIGGRAPH ’85: ACM SIGGRAPH 1985 Papers, ACM, New York, NY, USA, 245–254. Google ScholarDigital Library
45. Sifakis, E., Selle, A., Robinson-Mosher, A., and Fedkiw, R. 2006. Simulating speech with a physics-based facial muscle model. In Proc. of ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 261–270. Google ScholarDigital Library
46. Stone, M., DeCarlo, D., Oh, I., Rodriguez, C., Stere, A., Lees, A., and Bregler, C. 2004. Speaking with hands: creating animated conversational characters from recordings of human performance. In SIGGRAPH ’04: ACM SIGGRAPH 2004 Papers, ACM, New York, NY, USA, 506–513. Google ScholarDigital Library
47. Terken, J. 1991. Fundamental frequency and perceived prominence of accented syllables. The Journal of the Acoustical Society of America 89, 4, 1768–1777.Google ScholarCross Ref
48. Treuille, A., Lee, Y., and Popović, Z. 2007. Near-optimal character animation with continuous control. In SIGGRAPH ’07: ACM SIGGRAPH 2007 Papers, ACM, New York, NY, USA, 7. Google ScholarDigital Library
49. Wallbot, H. G. 1998. Bodily expression of emotion. European Journal of Social Psychology 28, 6, 879–896.Google ScholarCross Ref
50. Wang, J., and Bodenheimer, B. 2008. Synthesis and evaluation of linear motion transitions. ACM Transactions on Graphics 27, 1 (Mar), 1:1–1:15. Google ScholarDigital Library
51. Ward, J. H. 1963. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58, 301, 236–244.Google ScholarCross Ref
52. Xue, J., Borgstrom, J., Jiang, J., Bernstein, L., and Alwan, A. 2006. Acoustically-driven talking face synthesis using dynamic bayesian networks. IEEE International Conference on Multimedia and Expo (July), 1165–1168.Google Scholar

ACM Digital Library Publication:

Overview Page:

SIGGRAPH Asia 2009: Technical Papers

Submit a story:

If you would like to submit a story about this presentation, please contact us: historyarchives@siggraph.org

ACM SIGGRAPH HISTORY ARCHIVES

“Real-time prosody-driven synthesis of body language”

Conference:

Type(s):

Title:

Session/Category Title:

Presenter(s)/Author(s):

Moderator(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Submit a story:

Sponsored by: