Speaking with hands: creating animated conversational characters from recordings of human performance

We describe a method for using a database of recorded speech and captured motion to create an animated conversational character. People’s utterances are composed of short, clearly-delimited phrases; in each phrase, gesture and speech go together meaningfully and synchronize at a common point of maximum emphasis. We develop tools for collecting and managing performance data that exploit this structure. The tools help create scripts for performers, help annotate and segment performance data, and structure specific messages for characters to use within application contexts. Our animations then reproduce this structure. They recombine motion samples with new speech samples to recreate coherent phrases, and blend segments of speech and motion together phrase-by-phrase into extended utterances. By framing problems for utterance generation and synthesis so that they can draw closely on a talented performance, our techniques support the rapid construction of animated characters with rich and appropriate expression.

References:

1. ARIKAN, O., AND FORSYTH, D. A. 2002. Interactive motion generation from examples. ACM Trans. Graph. 21, 3, 483–490. Google ScholarDigital Library
2. ARIKAN, O., FORSYTH, D. A., AND O’BRIEN, J. F. 2003. Motion synthesis from annotations. ACM Trans. Graph. 22, 3, 402–408. Google ScholarDigital Library
3. BANGALORE, S., AND RAMBOW, O. 2000. Exploiting a probabilistic hierarchical model for generation. In Proceedings of COLING, 42–48. Google ScholarDigital Library
4. BAVELAS, J. B., AND CHOVIL, N. 2000. Visible acts of meaning: An integrated message model of language in face-to-face dialogue. Journal of Language and Social Psychology 19, 2, 163–194.Google ScholarCross Ref
5. BESKOW, J., EDLUND, J., AND NORDSTRAND, M. 2002. Specification and realisation of multimodal output in dialogue systems. In Proceedings of ICSLP, 181–184.Google Scholar
6. BESKOW, J. 2003. Talking Heads: Models and Applications for Multimodal Speech Synthesis. PhD thesis, KTH Stockholm.Google Scholar
7. BEUTNAGEL, M., CONKIE, A., SCHROETER, J., STYLIANOU, Y., AND SYRDAL, A. 1999. The AT&T Next-Gen TTS system. In Joint Meeting of ASA, EAA, and DAGA, 18–24.Google ScholarCross Ref
8. BICKMORE, T. W. 2003. Relational Agents: Effecting Change through Human-Computer Relationships. PhD thesis, MIT.Google Scholar
9. BLACK, A. W., AND LENZO, K. A. 2000. Limited domain synthesis. In Proceedings of ICSLP, vol. II, 411–414.Google ScholarCross Ref
10. BRAND, M. 1999. Voice puppetry. In SIGGRAPH, 21–28. Google ScholarDigital Library
11. BULYKO, I., AND OSTENDORF, M. 2002. Efficient integrated response generation from multiple targets using weighted finite state transducers. Computer Speech and Language 16, 533–550.Google ScholarCross Ref
12. CASSELL, J., PELACHAUD, C., BADLER, N., STEEDMAN, M., ACHORN, B., BECKET, T., DOUVILLE, B., PREVOST, S., AND STONE, M. 1994. Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents. In SIGGRAPH, 413–420. Google ScholarDigital Library
13. CASSELL, J., SULLIVAN, J., PREVOST, S., AND CHURCHILL, E., Eds. 2000. Embodied Conversational Agents. MIT. Google ScholarDigital Library
14. CASSELL, J., VILHJÁLMSSON, H., AND BICKMORE, T. 2001. BEAT: the behavioral expression animation toolkit. In SIGGRAPH, 477–486. Google ScholarDigital Library
15. CASSELL, J. 2000. Embodied conversational interface agents. Communications of the ACM 43, 4, 70–78. Google ScholarDigital Library
16. CERRATO, L., AND SKHIRI, M. 2003. A method for the analysis and measurement of communicative head movements in human dialogues. In Proceedings of AVSP, 251–256.Google Scholar
17. CHI, D., COSTA, M., ZHAO, L., AND BADLER, N. 2000. The EMOTE model for effort and shape. In SIGGRAPH, 173–182. Google ScholarDigital Library
18. DECARLO, D., REVILLA, C., STONE, M., AND VENDITTI, J. 2002. Making discourse visible: coding and animating conversational facial displays. In Proceedings of Computer Animation, 11–16. Google ScholarDigital Library
19. EKMAN, P. 1979. About brows: Emotional and conversational signals. In Human Ethology: Claims and Limits of a New Discipline: Contributions to the Colloquium, M. von Cranach, K. Foppa, W. Lepenies, and D. Ploog, Eds. Cambridge University Press, Cambridge, 169–202.Google Scholar
20. ENGLE, R. A. 2000. Toward a Theory of Multimodal Communication: Combining Speech, Gestures, Diagrams and Demonstrations in Instructional Explanations. PhD thesis, Stanford University.Google Scholar
21. EZZAT, T., GEIGER, G., AND POGGIO, T. 2002. Trainable videorealistic speech animation. ACM Trans. Graph. 21, 3, 388–398. Google ScholarDigital Library
22. GLEICHER, M. 1998. Retargeting motion to new characters. In SIGGRAPH, 33–42. Google ScholarDigital Library
23. HOPCROFT, J. E., MOTWANI, R., AND ULLMAN, J. D. 2000. Introduction to automata theory, languages and computation, second ed. Addison-Wesley. Google ScholarDigital Library
24. HUNT, A. J., AND BLACK, A. W. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of ICASSP, vol. I, 373–376. Google ScholarDigital Library
25. JURAFSKY, D., AND MARTIN, J. H. 2000. Speech and Language Processing: An introduction to natural language processing, computational linguistics and speech recognition. Prentice-Hall. Google ScholarDigital Library
26. KIM, T., PARK, S. I., AND SHIN, S. Y. 2003. Rhythmic-motion synthesis based on motion-beat analysis. ACM Trans. Graph. 22, 3, 392–401. Google ScholarDigital Library
27. KLEIN, D., AND MANNING, C. D. 2003. Factored A* search for models over sequences and trees. In Proceedings of IJCAI. Google ScholarDigital Library
28. KOPP, S., AND WACHSMUTH, I. 2004. Synthesizing multimodal utterances for conversational agents. Computer Animation and Virtual Worlds 15, 1, 39–52. Google ScholarDigital Library
29. KOVAR, L., AND GLEICHER, M. 2003. Flexible automatic motion blending with registration curvers. In Symposium on Computer Animation. Google ScholarDigital Library
30. KOVAR, L., GLEICHER, M., AND PIGHIN, F. 2002. Motion graphs. ACM Trans. Graph. 21, 3, 473–482. Google ScholarDigital Library
31. KOVAR, L., SCHREINER, J., AND GLEICHER, M. 2002. Footskate cleanup for motion capture editing. In Symposium on Computer Animation. Google ScholarDigital Library
32. KRAHMER, E., RUTTKAY, Z., SWERTS, M., AND WESSELINK, W. 2002. Audiovisual cues to prominence. In Proceedings of ICSLP.Google Scholar
33. LANGKILDE, I. 2000. Forest-based statistical sentence generation. In Applied Natural Language Processing Conference, 170–177. Google ScholarDigital Library
34. LEE, J., CHAI, J., REITSMA, P. S. A., HODGINS, J. K., AND POLLARD, N. S. 2002. Interactive control of avatars animated with human motion data. ACM Trans. Graph. 21, 3, 491–500. Google ScholarDigital Library
35. LEE, S. P., BADLER, J. B., AND BADLER, N. I. 2002. Eyes alive. ACM Trans. Graph. 21, 3, 637–644. Google ScholarDigital Library
36. LI, Y., WANG, T., AND SHUM, H.-Y. 2002. Motion texture: a two-level statistical model for character motion synthesis. ACM Trans. Graph. 21, 3, 465–472. Google ScholarDigital Library
37. MCNEILL, D., QUEK, F., MCCULLOUGH, K.-E., DUNCAN, S., FURUYAMA, N., BRYLL, R., MA, X.-F., AND ANSARI, R. 2001. Catchments, prosody and discourse. Gesture 1, 1, 9–33.Google ScholarCross Ref
38. MCNEILL, D. 1992. Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press, Chicago.Google Scholar
39. PAN, S., AND WANG, W. 2002. Designing a speech corpus for instance-based spoken language generation. In Proceedings of Int. Conf. on Natural Language Generation, 49–56.Google Scholar
40. PARK, S. I., SHIN, H. J., AND SHIN, S. Y. 2002. On-line locomotion generation based on motion blending. Proc. ACM SIGGRAPH Symposium on Computer Animation, 105–111. Google ScholarDigital Library
41. PELACHAUD, C., BADLER, N., AND STEEDMAN, M. 1996. Generating facial expressions for speech. Cognitive Science 20, 1, 1–46.Google ScholarCross Ref
42. PERLIN, K., AND GOLDBERG, A. 1996. Improv: a system for interactive actors in virtual worlds. In SIGGRAPH, 205–216. Google ScholarDigital Library
43. PIERREHUMBERT, J., AND HIRSCHBERG, J. 1990. The meaning of intonational contours in the interpretation of discourse. In Intentions in Communication, P. Cohen, J. Morgan, and M. Pollack, Eds. MIT Press, Cambridge MA, 271–311.Google Scholar
44. POPOVIĆ, Z., AND WITKIN, A. 1999. Physically based motion transformation. In SIGGRAPH, 11–20. Google ScholarDigital Library
45. PULLEN, K., AND BREGLER, C. 2002. Motion capture assisted animation: texturing and synthesis. ACM Trans. Graph. 21, 3, 501–508. Google ScholarDigital Library
46. REITER, E., AND DALE, R. 2000. Building Natural Language Generation Systems. Cambridge University Press. Google ScholarDigital Library
47. REITSMA, P. S. A., AND POLLARD, N. S. 2003. Perceptual metrics for character animation: sensitivity to errors in ballistic motion. ACM Trans. Graph. 22, 3, 537–542. Google ScholarDigital Library
48. ROSE, C., COHEN, M., AND BODENHEIMER, B. 1998. Verbs and adverbs: Multidimensional motion interpolation. IEEE Computer Graphics and Applications 18, 5, 32–40. Google ScholarDigital Library
49. SCHROETER, J., OSTERMANN, J., GRAF, H. P., BEUTNAGEL, M., COSATTO, E., SYRDAL, A., AND CONKIE, A. 2000. Multimodal speech synthesis. In IEEE International Conference on Multimedia and Expo, vol. I, 571–574.Google ScholarCross Ref
50. SENEFF, S. 2002. Response planning and generation in the Mercury flight reservation system. Computer Speech and Language 16, 282–312.Google ScholarCross Ref
51. SILVERMAN, K. E. A., BECKMAN, M., PITRELLI, J. F., OSTENDORF, M., WIGHTMAN, C., PRICE, P., AND PIERREHUMBERT, J. 1992. ToBI: a standard for labeling English prosody. In Proceedings of ICSLP, 867–870.Google Scholar
52. STEEDMAN, M. 2000. Information structure and the syntax-phonology interface. Linguistic Inquiry 31, 4, 649–689.Google ScholarCross Ref
53. STONE, M., AND DECARLO, D. 2003. Crafting the illusion of meaning: Template-based generation of embodied conversational behavior. In Proceedings of Computer Animation and Social Agents, 11–16. Google ScholarDigital Library
54. THEUNE, M., AND KLABBERS, E. 2001. From data to speech: A general approach. Natural Language Engineering 7, 1, 47–86. Google ScholarDigital Library
55. WILLIAMS, L. 1990. Performance-driven facial animation. In SIGGRAPH, 235–242. Google ScholarDigital Library
56. WITKIN, A., AND POPOVIĆ, Z. 1995. Motion warping. In SIGGRAPH, 105–108. Google ScholarDigital Library

ACM Digital Library Publication:

Overview Page:

SIGGRAPH 2004: Technical Papers

“Speaking with hands: creating animated conversational characters from recordings of human performance” by Stone, DeCarlo, Oh, Rodriguez, Stere, et al. …

Conference:

Type:

Title:

Presenter(s)/Author(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Sponsored by: