“Model-based synthesis of visual speech movements from 3D video”

Conference:


Type(s):


Title:

    Model-based synthesis of visual speech movements from 3D video

Abstract:


    We describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g., HMMs, neural nets) with unit selection we improve the quality of our speech synthesis.

References:


    1. M. Mori, “The uncanny valley,” Energy, vol. 7, no. 4, pp. 33-35, 1970, translated by K. F. MacDorman and T. Minato.
    2. C. G. Fisher, “Confusions among visually perceived consonants,” Journal of Speech and Hearing Research, vol. 11, no. 4, pp. 796-804, 1968.
    3. T. Ezzat, G. Geiger, and T. Poggio, “Trainable videorealistic speech animation,” in Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’02), vol. 21, pp. 388-398, July 2002.
    4. I. Albrecht, J. Haber, and H.-P. Seidel, “Speech synchronization for physics-based facial animation,” in Proceedings of the 10th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG ’02), pp. 9-16, 2002.
    5. L. Reveret, G. Bailly, and P. Badin, “Mother: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation,” in Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP ’00), pp. 755-758, 2000.
    6. M. M. Cohen and D. W. Massaro, “Modeling coarticulation in synthetic visual speech,” in Models and Techniques in Computer Animation, Springer, Berlin, Germany, 1993.
    7. A. Löfqvist, “Speech as audible Gestures,” in Speech Production and Speech Modelling, pp. 289-322, Springer, Berlin, Germany, 1990.
    8. M. Cohen, D. Massaro, and R. Clark, “Training a talking head,” in Proceedings of the 4th IEEE International Conference on Multimodal Interfaces, pp. 499-510, 2002.
    9. S. Öhman, “Numerical model of coarticulation,” Journal of the Acoustical Society of America, vol. 41, pp. 310-320, 1967.
    10. Z. Deng, U. Neumann, J. P. Lewis, T.-Y. Kim, M. Bulut, and S. Narayanan, “Expressive facial animation synthesis by learning speech coarticulation and expression spaces,” IEEE Transactions on Visualization and Computer Graphics, vol. 12, no. 6, pp. 1523-1534, 2006.
    11. A. Black, P. Taylor, and R. Caley, “The festival speech synthesis system,” 1999.
    12. T. Dutoit, V. Pagel, N. Pierret, E. Bataille, and O. van der Vrecken, “The MBROLA project: towards a set of high quality speech synthesizers free of use for non commercial purposes,” in Proceedings of the International Conference on Spoken Language Processing (ICSLP ’96), vol. 3, pp. 1393-1396, 1996.
    13. C. Bregler, M. Covell, and M. Slaney, “Video Rewrite: driving visual speech with audio,” in Proceedings of the ACM SIGGRAPH Conference on Computer Graphics (SIGGRAPH ’97), pp. 353-360, Los Angeles, Calif, USA, August 1997.
    14. Z. Deng and U. Neumann, “eFASE: expressive facial animation synthesis and editing with phoneme-isomap controls,” in Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA ’06), pp. 251-260, 2006.
    15. S. Kshirsagar and N. Magnenat-Thalmann, “Visyllable based speech animation,” in Proceedings of the Annual Conference of the European Association for Computer Graphics (EUROGRAPHICS ’03), vol. 22, pp. 631-639, September 2003.
    16. Y. Cao, W. C. Tien, P. Faloutsos, and F. Pighin, “Expressive speech-driven facial animation,” ACM Transactions on Graphics, vol. 24, no. 4, pp. 1283-1302, 2005.
    17. L. Zhang and S. Renals, “Acoustic-articulatory modeling with the trajectory HMM,” IEEE Signal Processing Letters, vol. 15, pp. 245-248, 2008.
    18. M. Brand, “Voice puppetry,” in Proceedings of the 26th International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’99), pp. 21-28, 1999.
    19. D. W. Massaro, J. Beskow, M. M. Cohen, C. L. Fry, and T. Rodriguez, “Picture my voice: audio to visual speech synthesis using artificial neural networks,” in Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP ’99), pp. 133-138, 1999.
    20. B. Theobald and N. Wilkinson, “A probabilistic trajectory synthesis system for synthesising visual speech,” in Proceedings of the 9th International Conference on Spoken Language Processing (Interspeech ’08), 2008.
    21. T. Ezzat and T. Poggio, “Videorealistic talking faces: a morphing approach,” in Proceedings of the ESCA Workshop on Audio-Visual Speech Processing (AVSP ’97), pp. 141-144, 1997.
    22. J. D. Edge, A. Hilton, and P. Jackson, “Parameterisation of 3D speech lip movements,” in Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP ’08), 2008.
    23. P. Mueller, G. A. Kalberer, M. Proesmans, and L. Van Gool, “Realistic speech animation based on observed 3D face dynamics,” IEE Vision, Image & Signal Processing, vol. 152, pp. 491-500, 2005.
    24. I. A. Ypsilos, A. Hilton, and S. Rowe, “Video-rate capture of dynamic face shape and appearance,” in Proceedings of the 6th IEEE International Conference on Automatic Face and Gesture Recognition (FGR ’04), pp. 117-122, May 2004.
    25. L. Zhang, N. Snavely, B. Curless, and S. M. Seitz, “Spacetime faces: high resolution capture for modeling and animation,” in Proceedings of the 31st International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’04), pp. 548- 558, Los Angeles, Calif, USA, August 2004.
    26. O. Govokhina, G. Bailly, G. Breton, and P. Bagshaw, “A new trainable trajectory formation system for facial animation,” in Proceedings of the ISCA Workshop on Experimental Linguistics, pp. 25-32, 2006.
    27. http://www.3dmd.com/.
    28. Z. Zhang, “Iterative point matching for registration of free-form curves and surfaces,” International Journal of Computer Vision, vol. 13, no. 2, pp. 119-152, 1994.
    29. W. Fisher, G. Doddington, and K. Goudie-Marshall, “The DARPA speech recognition research database: specifications and status,” in Proceedings of the DARPA Workshop on Speech Recognition, pp. 93-99, 1986.
    30. S. Roweis, “EM algorithms for PCA and SPCA,” in Proceedings of the Neural Information Processing Systems Conference (NIPS ’97), pp. 626-632, 1997.
    31. J. Kruskal and M. Wish, Multidimensional Scaling, Sage, Beverly Hills, Calif, USA, 1979.
    32. P. Mermelstein, “Distance measures for speech recognition, psychological and instrumental,” in Pattern Recognition and Artificial Intelligence, pp. 374-388, Academic Press, New York, NY, USA, 1976.
    33. E. Vatikiotis-Bateson and D. J. Ostry, “Analysis and modeling of 3D jaw motion in speech and mastication,” in Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 2, pp. 442-447, Tokyo, Japan, October 1999.
    34. D. J. Ostry, E. Vatikiotis-Bateson, and P. L. Gribble, “An examination of the degrees of freedom of human jaw motion in speech and mastication,” Journal of Speech, Language, and Hearing Research, vol. 40, no. 6, pp. 1341-1351, 1997.
    35. E. Cosatto and H.-P. Graf, “Sample-based synthesis of photorealistic talking heads,” in Proceedings of the Computer Animation Conference, pp. 103-110, 1998.
    36. T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” in Proceedings of the European Conference on Computer Vision (ECCV ’98), pp. 484-498, 1998.


ACM Digital Library Publication:



Overview Page: