“Real-time prosody-driven synthesis of body language” – ACM SIGGRAPH HISTORY ARCHIVES

“Real-time prosody-driven synthesis of body language”

  • ©

Conference:


Type(s):


Title:

    Real-time prosody-driven synthesis of body language

Session/Category Title:   Character animation


Presenter(s)/Author(s):


Moderator(s):



Abstract:


    Human communication involves not only speech, but also a wide variety of gestures and body motions. Interactions in virtual environments often lack this multi-modal aspect of communication. We present a method for automatically synthesizing body language animations directly from the participants’ speech signals, without the need for additional input. Our system generates appropriate body language animations by selecting segments from motion capture data of real people in conversation. The synthesis can be performed progressively, with no advance knowledge of the utterance, making the system suitable for animating characters from live human speech. The selection is driven by a hidden Markov model and uses prosody-based features extracted from speech. The training phase is fully automatic and does not require hand-labeling of input data, and the synthesis phase is efficient enough to run in real time on live microphone input. User studies confirm that our method is able to produce realistic and compelling body language.

References:


    1. Adolphs, R. 2002. Neural systems for recognizing emotion. Current Opinion in Neurobiology 12, 2, 169–177.Google ScholarCross Ref
    2. Albrecht, I., Haber, J., and peter Seidel, H. 2002. Automatic generation of non-verbal facial expressions from speech. In Computer Graphics International, 283–293.Google Scholar
    3. Arikan, O., and Forsyth, D. A. 2002. Interactive motion generation from examples. ACM Transactions on Graphics 21, 3, 483–490. Google ScholarDigital Library
    4. Barbič, J., Safonova, A., Pan, J.-Y., Faloutsos, C., Hodgins, J. K., and Pollard, N. S. 2004. Segmenting motion capture data into distinct behaviors. In Proceedings of Graphics Interface 2004, 185–194. Google ScholarDigital Library
    5. Beskow, J. 2003. Talking Heads – Models and Applications for Multimodal Speech Synthesis. PhD thesis, KTH Stockholm.Google Scholar
    6. Boersma, P. 1993. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the Institute of Phonetic Sciences, vol. 17, 97–110.Google Scholar
    7. Boersma, P. 2001. Praat, a system for doing phonetics by computer. Glot International 5, 9–10, 314–345.Google Scholar
    8. Brand, M. 1999. Voice puppetry. In SIGGRAPH ’99: ACM SIGGRAPH 1999 Papers, ACM, New York, NY, USA, 21–28. Google ScholarDigital Library
    9. Bregler, C., Covell, M., and Slaney, M. 1997. Video rewrite: driving visual speech with audio. In SIGGRAPH ’97: ACM SIGGRAPH 1997 Papers, ACM, New York, NY, USA, 353–360. Google ScholarDigital Library
    10. Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., Douville, B., Prevost, S., and Stone, M. 1994. Animated conversation: rule-based generation of facial expression, gesture&spoken intonation for multiple conversational agents. In SIGGRAPH ’94: ACM SIGGRAPH 1994 Papers, ACM, New York, NY, USA, 413–420. Google ScholarDigital Library
    11. Cassell, J., Vilhjálmsson, H. H., and Bickmore, T. 2001. Beat: the behavior expression animation toolkit. In SIGGRAPH ’01: ACM SIGGRAPH 2001 Papers, 477–486. Google ScholarDigital Library
    12. Cassell, J. 2000. Embodied conversational interface agents. Communications of the ACM 43, 4, 70–78. Google ScholarDigital Library
    13. Chuang, E., and Bregler, C. 2005. Mood swings: expressive speech animation. ACM Transactions on Graphics 24, 2, 331–347. Google ScholarDigital Library
    14. Ezzat, T., Geiger, G., and Poggio, T. 2002. Trainable videorealistic speech animation. In SIGGRAPH ’02: ACM SIGGRAPH 2002 Papers, ACM, New York, NY, USA, 388–398. Google ScholarDigital Library
    15. Fod, A., Mataric, M. J., and Jenkins, O. C. 2002. Automated derivation of primitives for movement classification. Autonomous Robots 12, 39–54. Google ScholarDigital Library
    16. Hartmann, B., Mancini, M., and Pelachaud, C. 2002. Formational parameters and adaptive prototype instantiation for mpeg-4 compliant gesture synthesis. In Proceedings on Computer Animation, IEEE Computer Society, Washington, DC, USA, 111. Google ScholarDigital Library
    17. Jensen, C., Farnham, S. D., Drucker, S. M., and Kollock, P. 2000. The effect of communication modality on cooperation in online environments. In Proceedings of CHI ’00, ACM, New York, NY, USA, 470–477. Google ScholarDigital Library
    18. Ju, E., and Lee, J. 2008. Expressive facial gestures from motion capture data. Computer Graphics Forum 27, 2, 381–388.Google ScholarCross Ref
    19. Kendon, A. 2004. Gesture – Visible Action as Utterance. Cambridge University Press, New York, NY, USA.Google Scholar
    20. Kipp, M., Neff, M., and Albrecht, I. 2007. An annotation scheme for conversational gestures: how to economically capture timing and form. Language Resources and Evaluation 41, 3–4 (December), 325–339.Google ScholarCross Ref
    21. Kopp, S., and Wachsmuth, I. 2004. Synthesizing multimodal utterances for conversational agents: Research articles. Computer Animation and Virtual Worlds 15, 1, 39–52. Google ScholarDigital Library
    22. Kovar, L., Gleicher, M., and Pighin, F. 2002. Motion graphs. ACM Transactions on Graphics 21, 3, 473–482. Google ScholarDigital Library
    23. Krauss, R. M., Dushay, R. A., Chen, Y., and Rauscher, F. 1995. The communicative value of conversational hand gestur. Journal of Experimental Social Psychology 31, 6, 533–552.Google ScholarCross Ref
    24. Lee, J., Chai, J., Reitsma, P. S. A., Hodgins, J. K., and Pollard, N. S. 2002. Interactive control of avatars animated with human motion data. ACM Transactions on Graphics 21, 3, 491–500. Google ScholarDigital Library
    25. Li, Y., and Shum, H.-Y. 2006. Learning dynamic audio-visual mapping with input-output hidden markov models. IEEE Transactions on Multimedia 8, 3, 542–549. Google ScholarDigital Library
    26. Li, Y., Wang, T., and Shum, H.-Y. 2002. Motion texture: a two-level statistical model for character motion synthesis. In SIGGRAPH ’02: ACM SIGGRAPH 2002 Papers, ACM, New York, NY, USA, 465–472. Google ScholarDigital Library
    27. Loehr, D. 2004. Gesture and Intonation. PhD thesis, Georgetown University.Google Scholar
    28. Maeran, O., Piuri, V., and Storti Gajani, G. 1997. Speech recognition through phoneme segmentation and neural classification. Instrumentation and Measurement Technology Conference, 1997. IMTC/97. Proceedings. ‘Sensing, Processing, Networking.’, IEEE 2 (May), 1215–1220.Google ScholarCross Ref
    29. Majkowska, A., Zordan, V. B., and Faloutsos, P. 2006. Automatic splicing for hand and body animations. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer animation, 309–316. Google ScholarDigital Library
    30. McNeill, D. 1992. Hand and Mind: What Gestures Reveal About Thought. University Of Chicago Press.Google Scholar
    31. McNeill, D. 2005. Gesture and Thought. University Of Chicago Press, November.Google Scholar
    32. Montepare, J., Koff, E., Zaitchik, D., and Albert, M. 1999. The use of body movements and gestures as cues to emotions in younger and older adults. Journal of Nonverbal Behavior 23, 2, 133–152.Google ScholarCross Ref
    33. Morency, L.-P., Sidner, C., Lee, C., and Darrell, T. 2007. Head gestures for perceptual interfaces: The role of context in improving recognition. Artificial Intelligence 171, 8–9, 568–585. Google ScholarDigital Library
    34. Müller, M., Röder, T., and Clausen, M. 2005. Efficient content-based retrieval of motion capture data. In SIGGRAPH ’05: ACM SIGGRAPH 2005 Papers, ACM, New York, NY, USA, vol. 24, 677–685. Google ScholarDigital Library
    35. Neff, M., Kipp, M., Albrecht, I., and Seidel, H.-P. 2008. Gesture modeling and animation based on a probabilistic recreation of speaker style. ACM Transactions on Graphics 27, 1, 5:1–24. Google ScholarDigital Library
    36. Park, J., and Ko, H. 2008. Real-time continuous phoneme recognition system using class-dependent tied-mixture hmm with hbt structure for speech-driven lip-sync. IEEE Transactions on Multimedia 10, 7 (Nov.), 1299–1306. Google ScholarDigital Library
    37. Perlin, K., and Goldberg, A. 1996. Improv: a system for scripting interactive actors in virtual worlds. In SIGGRAPH ’96: ACM SIGGRAPH 1996 Papers, ACM, 205–216. Google ScholarDigital Library
    38. Perlin, K. 1997. Layered compositing of facial expression. In SIGGRAPH ’97: ACM SIGGRAPH 97 Visual Proceedings, ACM, New York, NY, USA, 226–227. Google ScholarDigital Library
    39. Rabiner, L. 1989. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 2, 257–286.Google ScholarCross Ref
    40. Ramshaw, L. A., and Marcus, M. P. 1995. Text chunking using transformation-based learning. In Proceedings of the Third Workshop on Very Large Corpora, 82–94.Google Scholar
    41. Sargin, M. E., Erzin, E., Yemez, Y., Tekalp, A. M., Erdem, A., Erdem, C., and Ozkan, M. 2007. Prosody-driven head-gesture animation. In ICASSP ’07: IEEE International Conference on Acoustics, Speech, and Signal Processing.Google Scholar
    42. Scherer, K. R., Banse, R., Wallbott, H. G., and Goldbeck, T. 1991. Vocal cues in emotion encoding and decoding. Motivation and Emotion 15, 2, 123–148.Google ScholarCross Ref
    43. Schröder, M. 2004. Speech and Emotion Research: An overview of research frameworks and a dimensional approach to emotional speech synthesis. PhD thesis, Phonus 7, Research Report of the Institute of Phonetics, Saarland University.Google Scholar
    44. Shoemake, K. 1985. Animating rotation with quaternion curves. In SIGGRAPH ’85: ACM SIGGRAPH 1985 Papers, ACM, New York, NY, USA, 245–254. Google ScholarDigital Library
    45. Sifakis, E., Selle, A., Robinson-Mosher, A., and Fedkiw, R. 2006. Simulating speech with a physics-based facial muscle model. In Proc. of ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 261–270. Google ScholarDigital Library
    46. Stone, M., DeCarlo, D., Oh, I., Rodriguez, C., Stere, A., Lees, A., and Bregler, C. 2004. Speaking with hands: creating animated conversational characters from recordings of human performance. In SIGGRAPH ’04: ACM SIGGRAPH 2004 Papers, ACM, New York, NY, USA, 506–513. Google ScholarDigital Library
    47. Terken, J. 1991. Fundamental frequency and perceived prominence of accented syllables. The Journal of the Acoustical Society of America 89, 4, 1768–1777.Google ScholarCross Ref
    48. Treuille, A., Lee, Y., and Popović, Z. 2007. Near-optimal character animation with continuous control. In SIGGRAPH ’07: ACM SIGGRAPH 2007 Papers, ACM, New York, NY, USA, 7. Google ScholarDigital Library
    49. Wallbot, H. G. 1998. Bodily expression of emotion. European Journal of Social Psychology 28, 6, 879–896.Google ScholarCross Ref
    50. Wang, J., and Bodenheimer, B. 2008. Synthesis and evaluation of linear motion transitions. ACM Transactions on Graphics 27, 1 (Mar), 1:1–1:15. Google ScholarDigital Library
    51. Ward, J. H. 1963. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58, 301, 236–244.Google ScholarCross Ref
    52. Xue, J., Borgstrom, J., Jiang, J., Bernstein, L., and Alwan, A. 2006. Acoustically-driven talking face synthesis using dynamic bayesian networks. IEEE International Conference on Multimedia and Expo (July), 1165–1168.Google Scholar


ACM Digital Library Publication:



Overview Page:



Submit a story:

If you would like to submit a story about this presentation, please contact us: historyarchives@siggraph.org