“High-fidelity facial and speech animation for VR HMDs” by Olszewski, Lim, Saito and Li
Conference:
Type(s):
Title:
- High-fidelity facial and speech animation for VR HMDs
Session/Category Title: Scanning & Tracking People
Presenter(s)/Author(s):
Abstract:
Significant challenges currently prohibit expressive interaction in virtual reality (VR). Occlusions introduced by head-mounted displays (HMDs) make existing facial tracking techniques intractable, and even state-of-the-art techniques used for real-time facial tracking in unconstrained environments fail to capture subtle details of the user’s facial expressions that are essential for compelling speech animation. We introduce a novel system for HMD users to control a digital avatar in real-time while producing plausible speech animation and emotional expressions. Using a monocular camera attached to an HMD, we record multiple subjects performing various facial expressions and speaking several phonetically-balanced sentences. These images are used with artist-generated animation data corresponding to these sequences to train a convolutional neural network (CNN) to regress images of a user’s mouth region to the parameters that control a digital avatar. To make training this system more tractable, we use audio-based alignment techniques to map images of multiple users making the same utterance to the corresponding animation parameters. We demonstrate that this approach is also feasible for tracking the expressions around the user’s eye region with an internal infrared (IR) camera, thereby enabling full facial tracking. This system requires no user-specific calibration, uses easily obtainable consumer hardware, and produces high-quality animations of speech and emotional expressions. Finally, we demonstrate the quality of our system on a variety of subjects and evaluate its performance against state-of-the-art real-time facial tracking techniques.
References:
1. Basu, S., Oliver, N., and Pentland, A. 1998. 3d modeling of human lip motion. In ICCV, 337–343.
2. Beeler, T., Hahn, F., Bradley, D., Bickel, B., Beardsley, P., Gotsman, C., Sumner, R. W., and Gross, M. 2011. High-quality passive facial performance capture using anchor frames. In ACM SIGGRAPH 2011 Papers, ACM, New York, NY, USA, SIGGRAPH ’11, 75:1–75:10.
3. Bermano, A., Beeler, T., Kozlov, Y., Bradley, D., Bickel, B., and Gross, M. 2015. Detailed spatio-temporal reconstruction of eyelids. ACM Trans. Graph. 34, 4 (July), 44:1–44:11.
4. Bhat, K. S., Goldenthal, R., Ye, Y., Mallet, R., and Koperwas, M. 2013. High fidelity facial animation capture and retargeting with contours. In SCA ’13, 7–14.
5. Bouaziz, S., Wang, Y., and Pauly, M. 2013. Online modeling for realtime facial animation. ACM Trans. Graph. 32, 4, 40:1–40:10.
6. Bradley, D., Heidrich, W., Popa, T., and Sheffer, A. 2010. High resolution passive facial performance capture. In ACM SIGGRAPH 2010 Papers, ACM, New York, NY, USA, SIGGRAPH ’10, 41:1–41:10.
7. Brand, M. 1999. Voice puppetry. In SIGGRAPH’99, 21–28.
8. Bregler, C., Covell, M., and Slaney, M. 1997. Video rewrite: Driving visual speech with audio. In SIGGRAPH ’97, 353–360.
9. Cao, C., Weng, Y., Lin, S., and Zhou, K. 2013. 3d shape regression for real-time facial animation. ACM Trans. Graph. 32, 4, 41:1–41:10.
10. Cao, C., Hou, Q., and Zhou, K. 2014. Displaced dynamic expression regression for real-time facial tracking and animation. ACM Trans. Graph. 33, 4, 43:1–43:10.
11. Cao, C., Bradley, D., Zhou, K., and Beeler, T. 2015. Real-time high-fidelity facial performance capture. ACM Transactions on Graphics (TOG) 34, 4, 46.
12. Chai, J.-x., Xiao, J., and Hodgins, J. 2003. Vision-based control of 3d facial animation. In SCA ’03, 193–206.
13. Character Shop, 1995. Facial waldo. http://www.character-shop.com/waldo.html.
14. Chen, Y.-L., Wu, H.-T., Shi, F., Tong, X., and Chai, J. 2013. Accurate and robust 3d facial capture using a single rgbd camera. In ICCV, IEEE, 3615–3622.
15. Chuang, E., and Bregler, C. 2005. Mood swings: Expressive speech animation. ACM Trans. Graph. 24, 2, 331–347.
16. Cootes, T. F., Edwards, G. J., and Taylor, C. J. 2001. Active appearance models. IEEE TPAMI 23, 6, 681–685.
17. Deng, Z., Neumann, U., Lewis, J. P., 0002, T.-Y. K., Bulut, M., and Narayanan, S. 2006. Expressive facial animation synthesis by learning speech coarticulation and expression spaces. IEEE Trans. Vis. Comput. Graph. 12, 6, 1523–1534.
18. Ekman, P., and Friesen, W. 1978. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto.
19. Ezzat, T., and Poggio, T. 2000. Visual speech synthesis by morphing visemes. International Journal of Computer Vision 38, 1, 45–57.
20. Faceshift, 2014. http://www.faceshift.com/.
21. Fan, B., Wang, L., Soong, F. K., and Xie, L. 2015. Photo-real talking head with deep bidirectional LSTM. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19–24, 2015, 4884–4888.
22. Fove, 2015. http://www.getfove.com/.
23. Fyffe, G., Jones, A., Alexander, O., Ichikari, R., and Debevec, P. 2014. Driving high-resolution facial scans with video performance capture. ACM Trans. Graph. 34, 1 (Dec.), 8:1–8:14.
24. Garrido, P., Valgaert, L., Wu, C., and Theobalt, C. 2013. Reconstructing detailed dynamic face geometry from monocular video. ACM Trans. Graph. 32, 6, 158:1–158:10.
25. Google, 2014. Google cardboard. https://www.google.com/get/cardboard/.
26. Gruebler, A., and Suzuki, K. 2014. Design of a wearable device for reading positive expressions from facial emg signals. Affective Computing, IEEE Transactions on 5, 3, 227–237. Cross Ref
27. Guenter, B., Grimm, C., Wood, D., Malvar, H., and Pighin, F. 1998. Making faces. In Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, ACM, New York, NY, USA, SIGGRAPH ’98, 55–66.
28. Harvard, 1969. Harvard sentences. http://www.cs.columbia.edu/~hgs/audio/harvard.html.
29. Hsieh, P.-L., Ma, C., Yu, J., and Li, H. 2015. Unconstrained realtime facial performance capture. In CVPR, to appear.
30. HTC, 2016. HTC Vive. https://www.htcvive.com/.
31. Huang, H., Allison, R. S., and Jenkin, M. 2004. Combined head-eye tracking for immersive virtual reality. ICAT’2004 14th International Conference on Artificial Reality and Telexistance.
32. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.
33. Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems.
34. Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. 2016. Building machines that learn and think like people. arXiv preprint arXiv:1604.00289.
35. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (November), 2278–2324. Cross Ref
36. Li, H., Adams, B., Guibas, L. J., and Pauly, M. 2009. Robust single-view geometry and motion reconstruction. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia 2009) 28, 5 (December).
37. Li, H., Weise, T., and Pauly, M. 2010. Example-based facial rigging. ACM Trans. Graph. 29, 4, 32:1–32:6.
38. Li, H., Yu, J., Ye, Y., and Bregler, C. 2013. Realtime facial animation with on-the-fly correctives. ACM Trans. Graph. 32, 4, 42:1–42:10.
39. Li, H., Trutoiu, L., Olszewski, K., Wei, L., Trutna, T., Hsieh, P.-L., Nicholls, A., and Ma, C. 2015. Facial performance sensing head-mounted display. ACM Transactions on Graphics (Proceedings SIGGRAPH 2015) 34, 4 (July).
40. Liu, Y., Xu, F., Chai, J., Tong, X., Wang, L., and Huo, Q. 2015. Video-audio driven real-time facial animation. ACM Trans. Graph. 34, 6 (Oct.), 182:1–182:10.
41. Lucero, J. C., and Munhall, K. G. 1999. A model of facial biomechanics for speech production. The Journal of the Acoustical Society of America 106, 5, 2834–2842. Cross Ref
42. Massaro, D. W., Beskow, J., Cohen, M. M., Fry, C. L., and Rodriguez, T. 1999. Picture my voice : Audio to visual speech synthesis using artificial neural networks. In Proceedings of International Conference on Auditory-Visual Speech Processing, 133–138. QC 20100507.
43. McFarland, D. J., and Wolpaw, J. R. 2011. Brain-computer interfaces for communication and control. Commun. ACM 54, 5 (May), 60–66.
44. Oculus VR, 2014. Oculus Rift DK2. https://www.oculus.com/dk2/.
45. Parke, F. I., and Waters, K. 1996. Computer Facial Animation. A. K. Peters.
46. Pighin, F., and Lewis, J. P. 2006. Performance-driven facial animation. In ACM SIGGRAPH 2006 Courses, SIGGRAPH ’06.
47. Romera-Paredes, B., Zhang, C., and Zhang, Z. 2014. Facial expression tracking from head-mounted, partially observing cameras. In IEEE International Conference on Multimedia and Expo, ICME 2014, Chengdu, China, July 14-18, 2014, 1–6. Cross Ref
48. Sakoe, H., and Chiba, S. 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. on Acoust., Speech, and Signal Process. ASSP 26, 43–49. Cross Ref
49. Saragih, J. M., Lucey, S., and Cohn, J. F. 2011. Deformable model fitting by regularized landmark mean-shift. International Journal of Computer Vision 91, 2, 200–215.
50. Scheirer, J., Fernandez, R., and Picard, R. W. 1999. Expression glasses: A wearable device for facial expression recognition. In CHI EA ’99, 262–263.
51. Shalev-Shwartz, S., and Shashua, A. 2016. On the sample complexity of end-to-end training vs. semantic abstraction training. arXiv preprint arXiv:1604.06915.
52. Shi, F., Wu, H.-T., Tong, X., and Chai, J. 2014. Automatic acquisition of high-fidelity facial performances using monocular videos. ACM Trans. Graph. 33, 6, 222:1–222:13.
53. Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.
54. Simonyan, K., and Zisserman, A. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, Eds. Curran Associates, Inc., 568–576.
55. SMI, 2014. Sensomotoric instruments. http://www.smivision.com/.
56. Steptoe, W., Steed, A., Rovira, A., and Rae, J. 2010. Lie tracking: Social presence, truth and deception in avatar-mediated telecommunication. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ACM, New York, NY, USA, CHI ’10, 1039–1048.
57. Suwajanakorn, S., Kemelmacher-Shlizerman, I., and Seitz, S. M. 2014. Total moving face reconstruction. In Computer Vision – ECCV 2014 – 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV, 796–812.
58. Taylor, S. L., Mahler, M., Theobald, B.-J., and Matthews, I. 2012. Dynamic units of visual speech. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Eurographics Association, Aire-la-Ville, Switzerland, Switzerland, SCA ’12, 275–284.
59. Terzopoulos, D., and Waters, K. 1990. Physically-based facial modelling, analysis, and animation. The Journal of Visualization and Computer Animation 1, 2, 73–80. Cross Ref
60. Toshev, A., and Szegedy, C. 2014. Deeppose: Human pose estimation via deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition.
61. van der Maaten, L., and Hinton, G. E. 2008. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research.
62. Wampler, K., Sasaki, D., Zhang, L., and Popović, Z. 2007. Dynamic, expressive speech animation from a single mesh. In Proceedings of the 2007 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Eurographics Association, Aire-la-Ville, Switzerland, Switzerland, SCA ’07, 53–62.
63. Wang, G.-Y., Yang, M.-T., Chiang, C.-C., and Tai, W.-K. 2006. A talking face driven by voice using hidden markov model. J. Inf. Sci. Eng. 22, 5, 1059–1075.
64. Wang, L., Xiong, Y., Wang, Z., and Qiao, Y. 2015. Towards Good Practices for Very Deep Two-Stream ConvNets. ArXiv e-prints (July).
65. Weise, T., Li, H., Gool, L. V., and Pauly, M. 2009. Face/off: Live facial puppetry. In Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer animation (Proc. SCA’09), Eurographics Association, ETH Zurich.
66. Weise, T., Bouaziz, S., Li, H., and Pauly, M. 2011. Realtime performance-based facial animation. ACM Trans. Graph. 30, 4, 77:1–77:10.
67. Weng, Y., Cao, C., Hou, Q., and Zhou, K. 2014. Real-time facial animation on mobile devices. Graphical Models 76, 3, 172–179. Cross Ref
68. Xie, L., and Liu, Z.-Q. 2007. A coupled hmm approach to video-realistic speech animation. Pattern Recogn. 40, 8 (Aug.), 2325–2340.
69. Xiong, X., and De la Torre, F. 2013. Supervised descent method and its application to face alignment. In CVPR, IEEE, 532–539.
70. Zhang, L., Snavely, N., Curless, B., and Seitz, S. M. 2004. Spacetime faces: High-resolution capture for modeling and animation. In ACM Annual Conference on Computer Graphics, 548–558.
71. Zhang, X., Sugano, Y., Fritz, M., and Bulling, A. 2015. Appearance-based gaze estimation in the wild. In Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 4511–4520.


