“Speaking with hands: creating animated conversational characters from recordings of human performance” by Stone, DeCarlo, Oh, Rodriguez, Stere, et al. …

  • ©Matthew Stone, Doug DeCarlo, Insuk Oh, Christian Rodriguez, Adrian Stere, Alyssa Lees, and Christoph (Chris) Bregler




    Speaking with hands: creating animated conversational characters from recordings of human performance



    We describe a method for using a database of recorded speech and captured motion to create an animated conversational character. People’s utterances are composed of short, clearly-delimited phrases; in each phrase, gesture and speech go together meaningfully and synchronize at a common point of maximum emphasis. We develop tools for collecting and managing performance data that exploit this structure. The tools help create scripts for performers, help annotate and segment performance data, and structure specific messages for characters to use within application contexts. Our animations then reproduce this structure. They recombine motion samples with new speech samples to recreate coherent phrases, and blend segments of speech and motion together phrase-by-phrase into extended utterances. By framing problems for utterance generation and synthesis so that they can draw closely on a talented performance, our techniques support the rapid construction of animated characters with rich and appropriate expression.


    1. ARIKAN, O., AND FORSYTH, D. A. 2002. Interactive motion generation from examples. ACM Trans. Graph. 21, 3, 483–490. Google ScholarDigital Library
    2. ARIKAN, O., FORSYTH, D. A., AND O’BRIEN, J. F. 2003. Motion synthesis from annotations. ACM Trans. Graph. 22, 3, 402–408. Google ScholarDigital Library
    3. BANGALORE, S., AND RAMBOW, O. 2000. Exploiting a probabilistic hierarchical model for generation. In Proceedings of COLING, 42–48. Google ScholarDigital Library
    4. BAVELAS, J. B., AND CHOVIL, N. 2000. Visible acts of meaning: An integrated message model of language in face-to-face dialogue. Journal of Language and Social Psychology 19, 2, 163–194.Google ScholarCross Ref
    5. BESKOW, J., EDLUND, J., AND NORDSTRAND, M. 2002. Specification and realisation of multimodal output in dialogue systems. In Proceedings of ICSLP, 181–184.Google Scholar
    6. BESKOW, J. 2003. Talking Heads: Models and Applications for Multimodal Speech Synthesis. PhD thesis, KTH Stockholm.Google Scholar
    7. BEUTNAGEL, M., CONKIE, A., SCHROETER, J., STYLIANOU, Y., AND SYRDAL, A. 1999. The AT&T Next-Gen TTS system. In Joint Meeting of ASA, EAA, and DAGA, 18–24.Google ScholarCross Ref
    8. BICKMORE, T. W. 2003. Relational Agents: Effecting Change through Human-Computer Relationships. PhD thesis, MIT.Google Scholar
    9. BLACK, A. W., AND LENZO, K. A. 2000. Limited domain synthesis. In Proceedings of ICSLP, vol. II, 411–414.Google ScholarCross Ref
    10. BRAND, M. 1999. Voice puppetry. In SIGGRAPH, 21–28. Google ScholarDigital Library
    11. BULYKO, I., AND OSTENDORF, M. 2002. Efficient integrated response generation from multiple targets using weighted finite state transducers. Computer Speech and Language 16, 533–550.Google ScholarCross Ref
    12. CASSELL, J., PELACHAUD, C., BADLER, N., STEEDMAN, M., ACHORN, B., BECKET, T., DOUVILLE, B., PREVOST, S., AND STONE, M. 1994. Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents. In SIGGRAPH, 413–420. Google ScholarDigital Library
    13. CASSELL, J., SULLIVAN, J., PREVOST, S., AND CHURCHILL, E., Eds. 2000. Embodied Conversational Agents. MIT. Google ScholarDigital Library
    14. CASSELL, J., VILHJÁLMSSON, H., AND BICKMORE, T. 2001. BEAT: the behavioral expression animation toolkit. In SIGGRAPH, 477–486. Google ScholarDigital Library
    15. CASSELL, J. 2000. Embodied conversational interface agents. Communications of the ACM 43, 4, 70–78. Google ScholarDigital Library
    16. CERRATO, L., AND SKHIRI, M. 2003. A method for the analysis and measurement of communicative head movements in human dialogues. In Proceedings of AVSP, 251–256.Google Scholar
    17. CHI, D., COSTA, M., ZHAO, L., AND BADLER, N. 2000. The EMOTE model for effort and shape. In SIGGRAPH, 173–182. Google ScholarDigital Library
    18. DECARLO, D., REVILLA, C., STONE, M., AND VENDITTI, J. 2002. Making discourse visible: coding and animating conversational facial displays. In Proceedings of Computer Animation, 11–16. Google ScholarDigital Library
    19. EKMAN, P. 1979. About brows: Emotional and conversational signals. In Human Ethology: Claims and Limits of a New Discipline: Contributions to the Colloquium, M. von Cranach, K. Foppa, W. Lepenies, and D. Ploog, Eds. Cambridge University Press, Cambridge, 169–202.Google Scholar
    20. ENGLE, R. A. 2000. Toward a Theory of Multimodal Communication: Combining Speech, Gestures, Diagrams and Demonstrations in Instructional Explanations. PhD thesis, Stanford University.Google Scholar
    21. EZZAT, T., GEIGER, G., AND POGGIO, T. 2002. Trainable videorealistic speech animation. ACM Trans. Graph. 21, 3, 388–398. Google ScholarDigital Library
    22. GLEICHER, M. 1998. Retargeting motion to new characters. In SIGGRAPH, 33–42. Google ScholarDigital Library
    23. HOPCROFT, J. E., MOTWANI, R., AND ULLMAN, J. D. 2000. Introduction to automata theory, languages and computation, second ed. Addison-Wesley. Google ScholarDigital Library
    24. HUNT, A. J., AND BLACK, A. W. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of ICASSP, vol. I, 373–376. Google ScholarDigital Library
    25. JURAFSKY, D., AND MARTIN, J. H. 2000. Speech and Language Processing: An introduction to natural language processing, computational linguistics and speech recognition. Prentice-Hall. Google ScholarDigital Library
    26. KIM, T., PARK, S. I., AND SHIN, S. Y. 2003. Rhythmic-motion synthesis based on motion-beat analysis. ACM Trans. Graph. 22, 3, 392–401. Google ScholarDigital Library
    27. KLEIN, D., AND MANNING, C. D. 2003. Factored A* search for models over sequences and trees. In Proceedings of IJCAI. Google ScholarDigital Library
    28. KOPP, S., AND WACHSMUTH, I. 2004. Synthesizing multimodal utterances for conversational agents. Computer Animation and Virtual Worlds 15, 1, 39–52. Google ScholarDigital Library
    29. KOVAR, L., AND GLEICHER, M. 2003. Flexible automatic motion blending with registration curvers. In Symposium on Computer Animation. Google ScholarDigital Library
    30. KOVAR, L., GLEICHER, M., AND PIGHIN, F. 2002. Motion graphs. ACM Trans. Graph. 21, 3, 473–482. Google ScholarDigital Library
    31. KOVAR, L., SCHREINER, J., AND GLEICHER, M. 2002. Footskate cleanup for motion capture editing. In Symposium on Computer Animation. Google ScholarDigital Library
    32. KRAHMER, E., RUTTKAY, Z., SWERTS, M., AND WESSELINK, W. 2002. Audiovisual cues to prominence. In Proceedings of ICSLP.Google Scholar
    33. LANGKILDE, I. 2000. Forest-based statistical sentence generation. In Applied Natural Language Processing Conference, 170–177. Google ScholarDigital Library
    34. LEE, J., CHAI, J., REITSMA, P. S. A., HODGINS, J. K., AND POLLARD, N. S. 2002. Interactive control of avatars animated with human motion data. ACM Trans. Graph. 21, 3, 491–500. Google ScholarDigital Library
    35. LEE, S. P., BADLER, J. B., AND BADLER, N. I. 2002. Eyes alive. ACM Trans. Graph. 21, 3, 637–644. Google ScholarDigital Library
    36. LI, Y., WANG, T., AND SHUM, H.-Y. 2002. Motion texture: a two-level statistical model for character motion synthesis. ACM Trans. Graph. 21, 3, 465–472. Google ScholarDigital Library
    37. MCNEILL, D., QUEK, F., MCCULLOUGH, K.-E., DUNCAN, S., FURUYAMA, N., BRYLL, R., MA, X.-F., AND ANSARI, R. 2001. Catchments, prosody and discourse. Gesture 1, 1, 9–33.Google ScholarCross Ref
    38. MCNEILL, D. 1992. Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press, Chicago.Google Scholar
    39. PAN, S., AND WANG, W. 2002. Designing a speech corpus for instance-based spoken language generation. In Proceedings of Int. Conf. on Natural Language Generation, 49–56.Google Scholar
    40. PARK, S. I., SHIN, H. J., AND SHIN, S. Y. 2002. On-line locomotion generation based on motion blending. Proc. ACM SIGGRAPH Symposium on Computer Animation, 105–111. Google ScholarDigital Library
    41. PELACHAUD, C., BADLER, N., AND STEEDMAN, M. 1996. Generating facial expressions for speech. Cognitive Science 20, 1, 1–46.Google ScholarCross Ref
    42. PERLIN, K., AND GOLDBERG, A. 1996. Improv: a system for interactive actors in virtual worlds. In SIGGRAPH, 205–216. Google ScholarDigital Library
    43. PIERREHUMBERT, J., AND HIRSCHBERG, J. 1990. The meaning of intonational contours in the interpretation of discourse. In Intentions in Communication, P. Cohen, J. Morgan, and M. Pollack, Eds. MIT Press, Cambridge MA, 271–311.Google Scholar
    44. POPOVIĆ, Z., AND WITKIN, A. 1999. Physically based motion transformation. In SIGGRAPH, 11–20. Google ScholarDigital Library
    45. PULLEN, K., AND BREGLER, C. 2002. Motion capture assisted animation: texturing and synthesis. ACM Trans. Graph. 21, 3, 501–508. Google ScholarDigital Library
    46. REITER, E., AND DALE, R. 2000. Building Natural Language Generation Systems. Cambridge University Press. Google ScholarDigital Library
    47. REITSMA, P. S. A., AND POLLARD, N. S. 2003. Perceptual metrics for character animation: sensitivity to errors in ballistic motion. ACM Trans. Graph. 22, 3, 537–542. Google ScholarDigital Library
    48. ROSE, C., COHEN, M., AND BODENHEIMER, B. 1998. Verbs and adverbs: Multidimensional motion interpolation. IEEE Computer Graphics and Applications 18, 5, 32–40. Google ScholarDigital Library
    49. SCHROETER, J., OSTERMANN, J., GRAF, H. P., BEUTNAGEL, M., COSATTO, E., SYRDAL, A., AND CONKIE, A. 2000. Multimodal speech synthesis. In IEEE International Conference on Multimedia and Expo, vol. I, 571–574.Google ScholarCross Ref
    50. SENEFF, S. 2002. Response planning and generation in the Mercury flight reservation system. Computer Speech and Language 16, 282–312.Google ScholarCross Ref
    51. SILVERMAN, K. E. A., BECKMAN, M., PITRELLI, J. F., OSTENDORF, M., WIGHTMAN, C., PRICE, P., AND PIERREHUMBERT, J. 1992. ToBI: a standard for labeling English prosody. In Proceedings of ICSLP, 867–870.Google Scholar
    52. STEEDMAN, M. 2000. Information structure and the syntax-phonology interface. Linguistic Inquiry 31, 4, 649–689.Google ScholarCross Ref
    53. STONE, M., AND DECARLO, D. 2003. Crafting the illusion of meaning: Template-based generation of embodied conversational behavior. In Proceedings of Computer Animation and Social Agents, 11–16. Google ScholarDigital Library
    54. THEUNE, M., AND KLABBERS, E. 2001. From data to speech: A general approach. Natural Language Engineering 7, 1, 47–86. Google ScholarDigital Library
    55. WILLIAMS, L. 1990. Performance-driven facial animation. In SIGGRAPH, 235–242. Google ScholarDigital Library
    56. WITKIN, A., AND POPOVIĆ, Z. 1995. Motion warping. In SIGGRAPH, 105–108. Google ScholarDigital Library

ACM Digital Library Publication:

Overview Page: