Trainable videorealistic speech animation

Tony Ezzat; Gadi Geiger; Tomaso Poggio

“Trainable videorealistic speech animation”

Next: “Training a Deep Remastering Model” by... »

« Previous: “Traffic Analysis and Display for the San...

Conference:

SIGGRAPH 2002

Type(s):

Technical Papers

Title:

Trainable videorealistic speech animation

Presenter(s)/Author(s):

Tony Ezzat

Gadi Geiger

Tomaso Poggio

Abstract:

We describe how to create with machine learning techniques a generative, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject’s mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned.The two key contributions of this paper are 1) a variant of the multidimensional morphable model (MMM) to synthesize new, previously unseen mouth configurations from a small set of mouth image prototypes; and 2) a trajectory synthesis technique based on regularization, which is automatically trained from the recorded video corpus, and which is capable of synthesizing trajectories in MMM space corresponding to any desired utterance.

References:

1. BARRON, J. L., FLEET, D. J., AND BEAUCHEMIN, S. S. 1994. Performance of optical flow techniques. International Journal of Computer Vision 12, 1, 43-77. Google Scholar
2. BEIER, T., AND NEELY, S. 1992. Feature-based image metamorphosis. In Computer Graphics (Proceedings of ACM SIGGRAPH 92), vol. 26(2), ACM, 35-42. Google Scholar
3. BERGEN, J., ANANDAN, P., HANNA, K., AND HINGORANI, R. 1992. Hierarchical model-based motion estimation. In Proceedings of the European Conference on Computer Vision, 237-252. Google Scholar
4. BEYMER, D., AND POGGIO, T. 1996. Image representations for visual learning. Science 272, 1905-1909.Google Scholar
5. BEYMER, D., SHASHUA, A., AND POGGIO, T. 1993. Example based image analysis and synthesis. Tech. Rep. 1431, MIT AI Lab. Google Scholar
6. BISHOP, C. M. 1995. Neural Networks for Pattern Recognition. Clarendon Press, Oxford. Google Scholar
7. BLACK, A., AND TAYLOR, P. 1997. The Festival Speech Synthesis System. University of Edinburgh.Google Scholar
8. BLACK, M., FLEET, D., AND YACOOB, Y. 2000. Robustly estimating changes in image appearance. Computer Vision and Image Understanding, Special Issue on Robust Statistical Techniques in Image Understanding, 8-31. Google Scholar
9. BLANZ, V., AND VETTER, T. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of SIGGRAPH 2001, ACM Press / ACM SIGGRAPH, Los Angeles, A. Rockwood, Ed., Computer Graphics Proceedings, Annual Conference Series, ACM, 187-194. Google Scholar
10. BRAND, M., AND HERTZMANN, A. 2000. Style machines. In Proceedings of SIGGRAPH 2000, ACM Press / ACM SIGGRAPH, K. Akeley, Ed., Computer Graphics Proceedings, Annual Conference Series, ACM, 183-192. Google Scholar
11. BRAND, M. 1999. Voice puppetry. In Proceedings of SIGGRAPH 1999, ACM Press / ACM SIGGRAPH, Los Angeles, A. Rockwood, Ed., Computer Graphics Proceedings, Annual Conference Series, ACM, 21-28. Google Scholar
12. BREGLER, C., COVELL, M., AND SLANEY, M. 1997. Video rewrite: Driving visual speech with audio. In Proceedings of SIGGRAPH 1997, ACM Press / ACM SIGGRAPH, Los Angeles, CA, Computer Graphics Proceedings, Annual Conference Series, ACM, 353-360. Google Scholar
13. BROOKE, N., AND SCOTT, S. 1994. Computer graphics animations of talking faces based on stochastic models. In Intl. Symposium on Speech, Image Processing, and Neural Networks.Google Scholar
14. BURT, P. J., AND ADELSON, E. H. 1983. The laplacian pyramid as a compact image code. IEEE Trans. on Communications COM-31, 4 (Apr.), 532-540.Google Scholar
15. CHEN, S. E., AND WILLIAMS, L. 1993. View interpolation for image synthesis. In Proceedings of SIGGRAPH 1993, ACM Press / ACM SIGGRAPH, Anaheim, CA, Computer Graphics Proceedings, Annual Conference Series, ACM, 279-288. Google Scholar
16. COHEN, M. M., AND MASSARO, D. W. 1993. Modeling coarticulation in synthetic visual speech. In Models and Techniques in Computer Animation, N. M. Thalmann and D. Thalmann, Eds. Springer-Verlag, Tokyo, 139-156.Google Scholar
17. COOTES, T. F., EDWARDS, G. J., AND TAYLOR, C. J. 1998. Active appearance models. In Proceedings of the European Conference on Computer Vision. Google Scholar
18. CORMEN, T. H., LEISERSON, C. E., AND RIVEST, R. L. 1989. Introduction to Algorithms. The MIT Press and McGraw-Hill Book Company. Google Scholar
19. COSATTO, E., AND GRAF, H. 1998. Sample-based synthesis of photorealistic talking heads. In Proceedings of Computer Animation ’98, 103-110. Google Scholar
20. EZZAT, T., AND POGGIO, T. 2000. Visual speech synthesis by morphing visemes. International Journal of Computer Vision 38, 45-57. Google Scholar
21. GIROSI, F., JONES, M., AND POGGIO, T. 1993. Priors, stabilizers, and basis functions: From regularization to radial, tensor, and additive splines. Tech. Rep. 1430, MIT AI Lab, June. Google Scholar
22. GUENTER, B., GRIMM, C., WOOD, D., MALVAR, H., AND PIGHIN, F. 1998. Making faces. In Proceedings of SIGGRAPH 1998, ACM Press / ACM SIGGRAPH, Orlando, FL, Computer Graphics Proceedings, Annual Conference Series, ACM, 55-66. Google Scholar
23. HORN, B. K. P., AND SCHUNCK, B. G. 1981. Determining optical flow. Artificial Intelligence 17, 185-203.Google Scholar
24. HUANG, X., ALLEVA, F., HON, H.-W., HWANG, M.-Y., LEE, K.-F., AND ROSENFELD, R. 1993. The SPHINX-II speech recognition system: an overview (http://sourceforge.net/projects/cmusphinx/). Computer Speech and Language 7, 2, 137-148.Google Scholar
25. JONES, M., AND POGGIO, T. 1998. Multidimensional morphable models: A framework for representing and maching object classes. In Proceedings of the International Conference on Computer Vision. Google Scholar
26. LEE, S. Y., CHWA, K. Y., SHIN, S. Y., AND WOLBERG, G. 1995. Image metemorphosis using snakes and free-form deformations. In Proceedings of SIGGRAPH 1995, ACM Press / ACM SIGGRAPH, vol. 29 of Computer Graphics Proceedings, Annual Conference Series, ACM, 439-448. Google Scholar
27. LEE, Y., TERZOPOULOS, D., AND WATERS, K. 1995. Realistic modeling for facial animation. In Proceedings of SIGGRAPH 1995, ACM Press / ACM SIGGRAPH, Los Angeles, California, Computer Graphics Proceedings, Annual Conference Series, ACM, 55-62. Google Scholar
28. LEE, S. Y., WOLBERG, G., AND SHIN, S. Y. 1998. Polymorph: An algorithm for morphing among multiple images. IEEE Computer Graphics Applications 18, 58-71. Google Scholar
29. LEGOFF, B., AND BENOIT, C. 1996. A text-to-audiovisual-speech synthesizer for french. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).Google Scholar
30. MASUKO, T., KOBAYASHI, T., TAMURA, M., MASUBUCHI, J., AND TOKUDA, K. 1998. Text-to-visual speech synthesis based on parameter generation from hmm. In ICASSP.Google Scholar
31. MOULINES, E., AND CHARPENTIER, F. 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9, 453-467. Google Scholar
32. PARKE, F. I. 1974. A parametric model of human faces. PhD thesis, University of Utah. Google Scholar
33. PEARCE, A., WYVILL, B., WYVILL, G., AND HILL, D. 1986. Speech and expression: A computer solution to face animation. In Graphics Interface. Google Scholar
34. PIGHIN, F., HECKER, J., LISCHINSKI, D., SZELISKI, R., AND SALESIN, D. 1998. Synthesizing realistic facial expressions from photographs. In Proceedings of SIGGRAPH 1998, ACM Press / ACM SIGGRAPH, Orlando, FL, Computer Graphics Proceedings, Annual Conference Series, ACM, 75-84. Google Scholar
35. POGGIO, T., AND VETTER, T. 1992. Recognition and structure from one 2D model view: observations on prototypes, object classes and symmetries. Tech. Rep. 1347, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Google Scholar
36. ROWEIS, S. 1998. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems, The MIT Press, M. I. Jordan, M. J. Kearns, and S. A. Solla, Eds., vol. 10. Google Scholar
37. SCOTT, K., KAGELS, D., WATSON, S., ROM, H., WRIGHT, J., LEE, M., AND HUSSEY, K. 1994. Synthesis of speaker facial movement to match selected speech sequences. In Proceedings of the Fifth Australian Conference on Speech Science and Technology, vol. 2, 620-625.Google Scholar
38. SJLANDER, K., AND BESKOW, J. 2000. Wavesurfer – an open source speech tool. In Proc of ICSLP, vol. 4, 464-467.Google Scholar
39. TENENBAUM, J. B., DE SILVA, V., AND LANGFORD, J. C. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290 (Dec), 2319-2323.Google Scholar
40. TIPPING, M. E., AND BISHOP, C. M. 1999. Mixtures of probabilistic principal component analyzers. Neural Computation 11, 2, 443-482. Google Scholar
41. WAHBA, G. 1900. Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia.Google Scholar
42. WATERS, K. 1987. A muscle model for animating three-dimensional facial expressions. In Computer Graphics (Proceedings of ACM SIGGRAPH 87), vol. 21(4), ACM, 17-24. Google Scholar
43. WATSON, S., WRIGHT, J., SCOTT, K., KAGELS, D., FREDA, D., AND HUSSEY, K. 1997. An advanced morphing algorithm for interpolating phoneme images to simulate speech. Jet Propulsion Laboratory, California Institute of Technology.Google Scholar
44. WOLBERG, G. 1990. Digital Image Warping. IEEE Computer Society Press, Los Alamitos, CA. Google Scholar

ACM Digital Library Publication: