“Egocentric videoconferencing” by Elgharib, Mendiratta, Thies, Niessner, Seidel, et al. …
Conference:
Type(s):
Title:
- Egocentric videoconferencing
Session/Category Title: VR and Real-time Techniques
Presenter(s)/Author(s):
Abstract:
We introduce a method for egocentric videoconferencing that enables hands-free video calls, for instance by people wearing smart glasses or other mixed-reality devices. Videoconferencing portrays valuable non-verbal communication and face expression cues, but usually requires a front-facing camera. Using a frontal camera in a hands-free setting when a person is on the move is impractical. Even holding a mobile phone camera in the front of the face while sitting for a long duration is not convenient. To overcome these issues, we propose a low-cost wearable egocentric camera setup that can be integrated into smart glasses. Our goal is to mimic a classical video call, and therefore, we transform the egocentric perspective of this camera into a front facing video. To this end, we employ a conditional generative adversarial neural network that learns a transition from the highly distorted egocentric views to frontal views common in videoconferencing. Our approach learns to transfer expression details directly from the egocentric view without using a complex intermediate parametric expressions model, as it is used by related face reenactment methods. We successfully handle subtle expressions, not easily captured by parametric blendshape-based solutions, e.g., tongue movement, eye movements, eye blinking, strong expressions and depth varying movements. To get control over the rigid head movements in the target view, we condition the generator on synthetic renderings of a moving neutral face. This allows us to synthesis results at different head poses. Our technique produces temporally smooth video-realistic renderings in real-time using a video-to-video translation network in conjunction with a temporal discriminator. We demonstrate the improved capabilities of our technique by comparing against related state-of-the art approaches.
References:
1. Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. 2018. Recycle-GAN: Unsupervised Video Retargeting. In European Conference on Computer Vision (ECCV).Google Scholar
2. Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
3. Jie Cao, Yibo Hu, Hongwen Zhang, Ran He, and Zhenan Sun. 2018. Learning a High Fidelity Pose Invariant Model for High-resolution Face Frontalization. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
4. Jie Cao, Yibo Hu, Hongwen Zhang, Ran He, and Zhenan Sun. 2019. Towards High Fidelity Face Frontalization in the Wild. International Journal of Computer Vision (IJCV) (2019).Google ScholarDigital Library
5. Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that?. In British Machine Vision Conference (BMVC).Google Scholar
6. Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. 2019. Text-based Editing of Talking-head Video. ACM Trans. on Graph. (Proceedings of SIGGRAPH) (2019).Google Scholar
7. Pablo Garrido, Levi Valgaerts, Ole Rehmsen, Thorsten Thormaehlen, Patrick Perez, and Christian Theobalt. 2014. Automatic Face Reenactment. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
8. Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Patrick Pérez, and Christian Theobalt. 2016. Reconstruction of Personalized 3D Face Rigs from Monocular Video. ACM Trans. on Graph. (Proceedings of SIGGRAPH) (2016).Google Scholar
9. S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik. 2019. Learning Individual Styles of Conversational Gesture. In CVPR.Google Scholar
10. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS.Google Scholar
11. Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. 2014. Deep Speech: Scaling up end-to-end speech recognition. ArXiv (2014).Google Scholar
12. Tal Hassner, Shai Harel, Eran Paz, and Roee Enbar. 2015. Effective Face Frontalization in Unconstrained Images. In CVPR.Google Scholar
13. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In CVPR.Google Scholar
14. H. Kim, M. Elgharib, M. Zollhöfer, H.-P. Seidel, T. Beeler, C. Richardt, and C. Theobalt. 2019. Neural Style-Preserving Visual Dubbing. ACM Trans. on Graph. (Proceedings of SIGGRAPH Asia) (2019).Google Scholar
15. Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Niessner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt. 2018. Deep Video Portraits. ACM Trans. on Graph. (Proceedings of SIGGRAPH) (2018).Google Scholar
16. Hao Li, Laura Trutoiu, Kyle Olszewski, Lingyu Wei, Tristan Trutna, Pei-Lun Hsieh, Aaron Nicholls, and Chongyang Ma. 2015. Facial Performance Sensing Head-Mounted Display. ACM Trans. on Graph. (Proceedings SIGGRAPH) (2015).Google Scholar
17. Ce Liu, Jenny Yuen, Antonio Torralba, Josef Sivic, and William T. Freeman. 2008. SIFT Flow: Dense Correspondence across Different Scenes. In ECCV.Google Scholar
18. Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep Appearance Models for Face Rendering. ACM Trans. on Graph. (Proceedings of SIGGRAPH) (2018).Google Scholar
19. Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. 2019. Neural Volumes: Learning Dynamic Renderable Volumes from Images. ACM Trans. on Graph. (Proceedings of SIGGRAPH) (2019).Google ScholarDigital Library
20. Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, Aviral Agarwal, Jens Fursund, and Hao Li. 2018. paGAN: Real-time avatars using dynamic textures. ACM Trans. on Graph. (proceedings of SIGGRAPH Asia) (2018).Google Scholar
21. Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. 2016. High-Fidelity Facial and Speech Animation for VR HMDs. ACM Trans. on Graph. (Proceedings of SIGGRAPH Asia) (2016).Google Scholar
22. Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In BMVC.Google Scholar
23. Xi Peng, Xiang Yu, Kihyuk Sohn, Dimitris N. Metaxas, and Manmohan Chandraker. 2017. Reconstruction-Based Disentanglement for Pose-Invariant Face Recognition. In ICCV.Google Scholar
24. Inc. Pinscreen. 2019. Pinscreen, Inc. https://www.pinscreen.com//.Google Scholar
25. Nataniel Ruiz, Eunji Chong, and James M. Rehg. 2018. Fine-Grained Head Pose Estimation Without Keypoints. In CVPR Workshops.Google Scholar
26. Christos Sagonas, Yannis Panagakis, Stefanos Zafeiriou, and Maja Pantic. 2015. Robust Statistical Face Frontalization. In ICCV.Google Scholar
27. Eli Shlizerman, Lucio M Dery, Hayden Schoen, and Ira Kemelmacher-Shlizerman. 2017. Audio to Body Dynamics. In CVPR.Google Scholar
28. Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR).Google Scholar
29. Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Trans. on Graph. (Proceedings of SIGGRAPH) (2017).Google ScholarDigital Library
30. Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural Voice Puppetry: Audio-driven Facial Reenactment. In ECCV.Google Scholar
31. Justus Thies, Michael Zollhöfer, and Matthias Nießner. 2019. Deferred Neural Rendering: Image Synthesis using Neural Textures. ACM Trans. on Graph. (Proceedings of SIGGRAPH) (2019).Google ScholarDigital Library
32. Justus Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. 2016. Face2Face: Real-time Face Capture and Reenactment of RGB Videos. In CVPR.Google Scholar
33. J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. 2018a. FaceVR: Real-Time Gaze-Aware Facial Reenactment in Virtual Reality. ACM Trans. on Graph. (Proceedings of SIGGRAPH) (2018).Google Scholar
34. J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. 2018b. HeadOn: Realtime Reenactment of Human Portrait Videos. ACM Trans. on Graph. (Proceedings of SIGGRAPH) (2018).Google Scholar
35. Shih-En Wei, Jason M. Saragih, Tomas Simon, Adam W. Harley, Stephen Lombardi, Michal Perdoch, Alexander Hypes, Da wei Wang, Hernán Badino, and Yaser Sheikh. 2019. VR facial animation via multiview image translation. ACM Trans. on Graph. (Proceedings of SIGGRAPH) (2019).Google Scholar
36. O. Wiles, A.S. Koepke, and A. Zisserman. 2018. X2Face: A network for controlling face generation by using images, audio, and pose codes. In ECCV.Google Scholar
37. Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Manmohan Chandraker. 2017. Towards Large-Pose Face Frontalization in the Wild. In ICCV.Google Scholar
38. Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. 2018. BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation. In ECCV.Google Scholar
39. Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor S. Lempitsky. 2019. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. In ICCV.Google Scholar
40. Zhihong Zhang, Xu Chen, Beizhan Wang, Guosheng Hu, Wangmeng Zuo, and Edwin R. Hancock. 2019. Face Frontalization Using an Appearance-Flow-Based Convolutional Neural Network. IEEE Transactions on Image Processing (TIP) (2019).Google Scholar
41. X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. 2016. Face Alignment Across Large Poses: A 3D Solution. In CVPR.Google Scholar


