“Neural monocular 3D human motion capture with physical awareness” by Shimada, Golyanik, Xu, Perez and Theobalt
Conference:
Type(s):
Title:
- Neural monocular 3D human motion capture with physical awareness
Presenter(s)/Author(s):
Abstract:
We present a new trainable system for physically plausible markerless 3D human motion capture, which achieves state-of-the-art results in a broad range of challenging scenarios. Unlike most neural methods for human motion capture, our approach, which we dub “physionical”, is aware of physical and environmental constraints. It combines in a fully-differentiable way several key innovations, i.e., 1) a proportional-derivative controller, with gains predicted by a neural network, that reduces delays even in the presence of fast motions, 2) an explicit rigid body dynamics model and 3) a novel optimisation layer that prevents physically implausible foot-floor penetration as a hard constraint. The inputs to our system are 2D joint keypoints, which are canonicalised in a novel way so as to reduce the dependency on intrinsic camera parameters—both at train and test time. This enables more accurate global translation estimation without generalisability loss. Our model can be finetuned only with 2D annotations when the 3D annotations are not available. It produces smooth and physically-principled 3D motions in an interactive frame rate in a wide variety of challenging scenes, including newly recorded ones. Its advantages are especially noticeable on in-the-wild sequences that significantly differ from common 3D pose estimation benchmarks such as Human 3.6M and MPI-INF-3DHP. Qualitative results are provided in the supplementary video.
References:
1. Akshay Agrawal, Brandon Amos, Shane Barratt, Stephen Boyd, Steven Diamond, and J Zico Kolter. 2019a. Differentiable convex optimization layers. In Advances in neural information processing systems (NeurIPS).Google Scholar
2. A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and Z. Kolter. 2019b. Differentiable Convex Optimization Layers. In Advances in Neural Information Processing Systems (NeurIPS).Google Scholar
3. Sheldon Andrews, Ivan Huerta, Taku Komura, Leonid Sigal, and Kenny Mitchell. 2016. Real-Time Physics-Based Motion Capture with Sparse Sensors. In European Conference on Visual Media Production (CVMP).Google Scholar
4. Ronen Barzel, John F. Hughes, and Daniel N. Wood. 1996. Plausible Motion Simulation for Computer Graphics Animation. In Eurographics Workshop on Computer Animation and Simulation.Google Scholar
5. Kevin Bergamin, Simon Clavet, Daniel Holden, and James Richard Forbes. 2019. DReCon: data-driven responsive control of physics-based characters. ACM Transactions on Graphics (TOG) 38, 6 (2019).Google ScholarDigital Library
6. Liefeng Bo and Cristian Sminchisescu. 2008. Twin Gaussian Processes for Structured Prediction. International Journal of Computer Vision (IJCV) 87 (2008), 28–52.Google ScholarDigital Library
7. Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. 2016. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In European Conference on Computer Vision (ECCV).Google Scholar
8. Thomas Brox, Bodo Rosenhahn, Juergen Gall, and Daniel Cremers. 2010. Combined Region and Motion-Based 3D Tracking of Rigid and Articulated Objects. Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 32, 3 (2010), 402–415.Google ScholarDigital Library
9. Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Real-time Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2019).Google Scholar
10. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
11. Ching-Hang Chen and Deva Ramanan. 2017. 3D Human Pose Estimation = 2D Pose Estimation + Matching. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
12. Nuttapong Chentanez, Matthias Müller, Miles Macklin, Viktor Makoviychuk, and Stefan Jeschke. 2018. Physics-based motion capture imitation with deep reinforcement learning. In Proceedings of the 11th Annual International Conference on Motion, Interaction, and Games. 1–10.Google ScholarDigital Library
13. Erwin Coumans and Yunfei Bai. 2016. Pybullet, a python module for physics simulation for games, robotics and machine learning. GitHub repository (2016).Google Scholar
14. Rishabh Dabral, Nitesh B Gundavarapu, Abhishek Mitra, Rahuland Sharma, Ganesh Ramakrishnan, and Arjun Jain. 2019. Multi-Person 3D Human Pose Estimation from Monocular Images. In International Conference on 3D Vision (3DV).Google ScholarCross Ref
15. Rishabh Dabral, Anurag Mundhada, Uday Kusupati, Safeer Afaque, Abhishek Sharma, and Arjun Jain. 2018. Learning 3D Human Pose from Structure and Motion. In European Conference on Computer Vision (ECCV).Google ScholarDigital Library
16. Hooman Dejnabadi, Brigitte M. Jolles, Emilio Casanova, Pascal Fua, and Kamiar Aminian. 2006. Estimation and visualization of sagittal kinematics of lower limbs orientation using body-fixed sensors. Transactions on Biomedical Engineering 53, 7 (2006), 1385–1393.Google ScholarCross Ref
17. Ahmed Elhayek, Edilson de Aguiar, Arjun Jain, Jonathan Tompson, Leonid Pishchulin, Micha Andriluka, Chris Bregler, Bernt Schiele, and Christian Theobalt. 2015. Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
18. Matteo Fabbri, Fabio Lanzi, Simone Calderara, Stefano Alletto, and Rita Cucchiara. 2020. Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
19. Roy Featherstone. 2014. Rigid body dynamics algorithms.Google Scholar
20. Martin L. Felis. 2017. RBDL: an Efficient Rigid-Body Dynamics Library using Recursive Algorithms. Autonomous Robots 41, 2 (2017), 495–511.Google ScholarDigital Library
21. Juergen Gall, Bodo Rosenhahn, Thomas Brox, and Hans-Peter Seidel. 2010. Optimization and Filtering for Human Motion Capture – a Multi-Layer Framework. International Journal of Computer Vision (IJCV) 87, 1 (2010), 75–92.Google ScholarDigital Library
22. Marc Habermann, Weipeng Xu, Michael Zollhoefer, Gerard Pons-Moll, and Christian Theobalt. 2020. DeepCap: Monocular Human Performance Capture Using Weak Supervision. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
23. Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Gerard Pons-Moll, and Christian Theobalt. 2019. In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
24. Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. 2019. Resolving 3D Human Pose Ambiguities with 3D Scene Constraints. In International Conference on Computer Vision (ICCV).Google Scholar
25. Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2013. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 36, 7 (2013), 1325–1339.Google ScholarDigital Library
26. Yifeng Jiang, Tom Van Wouwe, Friedl De Groote, and C. Karen Liu. 2019. Synthesis of Biologically Realistic Human Motion Using Joint Torque Actuation. ACM Transactions On Graphics (TOG) 38, 4 (2019).Google ScholarDigital Library
27. Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. 2018. End-to-end Recovery of Human Shape and Pose. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
28. Angjoo Kanazawa, Jason Y. Zhang, Panna Felsen, and Jitendra Malik. 2019. Learning 3D Human Dynamics from Video. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
29. Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. 2020a. VIBE: Video Inference for Human Body Pose and Shape Estimation. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
30. Muhammed Kocabas, Nikos Athanasiou, and Michael J Black. 2020b. VIBE: Video inference for human body pose and shape estimation. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
31. Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. 2019. Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
32. Onorina Kovalenko, Vladislav Golyanik, Jameel Malik, Ahmed Elhayek, and Didier Stricker. 2019. Structure from Articulated Motion: Accurate and Stable Monocular 3D Reconstruction without Training Data. Sensors 19, 20 (2019).Google Scholar
33. Seunghwan Lee, Moonseok Park, Kyoungmin Lee, and Jehee Lee. 2019. Scalable Muscle-Actuated Human Simulation and Control. ACM Transactions On Graphics (TOG) 38, 4 (2019).Google ScholarDigital Library
34. Sergey Levine and Jovan Popović. 2012. Physically Plausible Simulation for Character Animation. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation.Google Scholar
35. Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, and Josef Sivic. 2019. Estimating 3D Motion and Forces of Person-Object Interactions from Monocular Video. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
36. Libin Liu, KangKang Yin, Michiel van de Panne, Tianjia Shao, and Weiwei Xu. 2010. Sampling-Based Contact-Rich Motion Control. ACM Transactions On Graphics (TOG) 29, 4 (2010), 128:1–128:10.Google ScholarDigital Library
37. Ricardo Martin-Brualla, Rohit Pandey, Shuoran Yang, Pavel Pidlypenskyi, Jonathan Taylor, Julien Valentin, Sameh Khamis, Philip Davidson, Anastasia Tkach, Peter Lincoln, Adarsh Kowdle, Christoph Rhemann, Dan B Goldman, Cem Keskin, Steve Seitz, Shahram Izadi, and Sean Fanello. 2018. LookinGood: Enhancing Performance Capture with Real-Time Neural Re-Rendering. ACM Transactions On Graphics (TOG) 37, 6 (2018).Google ScholarDigital Library
38. Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. 2017. A Simple Yet Effective Baseline for 3D Human Pose Estimation. In International Conference on Computer Vision (ICCV).Google Scholar
39. Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017a. Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision. In International Conference on 3D Vision (3DV).Google ScholarCross Ref
40. Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Mohammad Elgharib, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, and Christian Theobalt. 2020. XNect: Real-time Multi-Person 3D Motion Capture with a Single RGB Camera. ACM Transactions on Graphics (TOG) 39, 4.Google ScholarDigital Library
41. Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017b. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. ACM Transactions on Graphics 36, 4, 14.Google ScholarDigital Library
42. Francesc Moreno-Noguer. 2017. 3D Human Pose Estimation From a Single Image via Distance Matrix Regression. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
43. Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV).Google ScholarCross Ref
44. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (NeurIPS).Google Scholar
45. Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis. 2017. Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
46. Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. 2018. Learning to estimate 3D human pose and shape from a single color image. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
47. Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
48. Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018a. Deep-mimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG) 37, 4 (2018).Google ScholarDigital Library
49. Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Abbeel, and Sergey Levine. 2018b. SFV: Reinforcement Learning of Physical Skills from Videos. ACM Transactions On Graphics (TOG) 37, 6 (2018).Google ScholarDigital Library
50. Dewi Indriati Hadi Putri, Carmadi Machbub, et al. 2018. Gait Controllers on Humanoid Robot Using Kalman Filter and PD Controller. In International Conference on Control, Automation, Robotics and Vision (ICARCV).Google Scholar
51. Davis Rempe, Leonidas J Guibas, Aaron Hertzmann, Bryan Russell, Ruben Villegas, and Jimei Yang. 2020. Contact and Human Dynamics from Monocular Video. In European Conference on Computer Vision (ECCV).Google ScholarDigital Library
52. Helge Rhodin, Mathieu Salzmann, and Pascal Fua. 2018. Unsupervised Geometry-Aware Representation Learning for 3D Human Pose Estimation. In European Conference on Computer Vision (ECCV).Google Scholar
53. Grégory Rogez, Philippe Weinzaepfel, and Cordelia Schmid. 2019. LCR-Net++: Multi-Person 2D and 3D Pose Detection in Natural Images. Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2019).Google Scholar
54. Erfan Shahabpoor and Aleksandar Pavic. 2017. Measurement of Walking Ground Reactions in Real-Life Environments: A Systematic Review of Techniques and Technologies. Sensors 17, 9 (2017), 2085.Google Scholar
55. Dana Sharon and Michiel van de Panne. 2005. Synthesis of Controllers for Stylized Planar Bipedal Walking. In International Conference on Robotics and Animation (ICRA).Google Scholar
56. Mingyi Shi, Kfir Aberman, Andreas Aristidou, Taku Komura, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. 2020. MotioNet: 3D Human Motion Reconstruction from Monocular Video with Skeleton Consistency. ACM Transactions on Graphics (TOG) 40, 1 (2020), 1–15.Google ScholarDigital Library
57. Soshi Shimada, Vladislav Golyanik, Weipeng Xu, and Christian Theobalt. 2020. PhysCap: physically plausible monocular 3D motion capture in real time. ACM Transactions on Graphics (TOG) 39, 6 (2020).Google ScholarDigital Library
58. Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand Keypoint Detection in Single Images using Multiview Bootstrapping. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
59. Jie Song, Xu Chen, and Otmar Hilliges. 2020. Human Body Model Fitting by Learned Gradient Descent. (2020).Google Scholar
60. Jonathan Starck and Adrian Hilton. 2007. Surface capture for performance-based animation. IEEE Computer Graphics and Applications (CGA) 27, 3 (2007), 21–31.Google ScholarDigital Library
61. Tomomichi Sugihara and Yoshihiko Nakamura. 2006. Gravity compensation on humanoid robot control with robust joint servo and non-integrated rate-gyroscope. In International Conference on Humanoid Robots (ICHR).Google ScholarCross Ref
62. Yu Sun, Yun Ye, Wu Liu, Wenpeng Gao, YiLi Fu, and Tao Mei. 2019. Human mesh recovery from monocular images via a skeleton-disentangled representation. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
63. Jochen Tautges, Arno Zinke, Björn Krüger, Jan Baumann, Andreas Weber, Thomas Helten, Meinard Müller, Hans-Peter Seidel, and Bernd Eberhardt. 2011. Motion Reconstruction Using Sparse Accelerometer Data. ACM Trans. Graph. 30, 3 (2011).Google ScholarDigital Library
64. Bugra Tekin, Isinsu Katircioglu, Mathieu Salzmann, Vincent Lepetit, and Pascal Fua. 2016. Structured Prediction of 3D Human Pose with Deep Neural Networks. In British Machine Vision Conference (BMVC).Google ScholarCross Ref
65. Denis Tomè, Chris Russell, and Lourdes Agapito. 2017. Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
66. Daniel Vlasic, Rolf Adelsberger, Giovanni Vannucci, John Barnwell, Markus Gross, Wojciech Matusik, and Jovan Popović. 2007. Practical Motion Capture in Everyday Surroundings. ACM Trans. Graph. 26, 3 (2007).Google ScholarDigital Library
67. Timo von Marcard, Bodo Rosenhahn, Michael Black, and Gerard Pons-Moll. 2017. Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs. Annual Conference of the European Association for Computer Graphics (Eurographics) (2017), 349–360.Google ScholarDigital Library
68. Marek Vondrak, Leonid Sigal, Jessica Hodgins, and Odest Jenkins. 2012. Video-based 3D Motion Capture Through Biped Control. ACM Transactions On Graphics (TOG) 31, 4 (2012).Google ScholarDigital Library
69. Bastian Wandt and Bodo Rosenhahn. 2019. RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
70. Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
71. Xiaolin Wei and Jinxiang Chai. 2010. Videomocap: Modeling Physically Realistic Human Motion from Monocular Video Sequences. ACM Transactions on Graphics (TOG) 29, 4 (2010).Google ScholarDigital Library
72. Alexander W. Winkler, C. Dario Bellicoso, Marco Hutter, and Jonas Buchli. 2018. Gait and Trajectory Optimization for Legged Systems Through Phase-Based End-Effector Parameterization. IEEE Robotics and Automation Letters 3, 3 (2018), 1560–1567.Google ScholarCross Ref
73. Pawel Wrotek, Odest Chadwicke Jenkins, and Morgan McGuire. 2006. Dynamo: Dynamic, Data-Driven Character Control with Adjustable Balance. In ACM Sandbox Symposium on Video Games.Google Scholar
74. Chenglei Wu, Kiran Varanasi, and Christian Theobalt. 2012. Full Body Performance Capture under Uncontrolled and Varying Illumination: A Shading-Based Approach. In European Conference on Computer Vision (ECCV).Google ScholarDigital Library
75. Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. 2019. Monocular Total Capture: Posing Face, Body, and Hands in the Wild. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
76. Lan Xu, Weipeng Xu, Vladislav Golyanik, Marc Habermann, Lu Fang, and Christian Theobalt. 2020. EventCap: Monocular 3D Capture of High-Speed Human Motions using an Event Camera. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
77. Weipeng Xu, Avishek Chatterjee, Michael Zollhöfer, Helge Rhodin, Dushyant Mehta, Hans-PeterSeidel, andChristianTheobalt. 2018. MonoPerfCap: Human Performance Capture From Monocular Video. ACM Trans. Graph. 37, 2 (2018).Google ScholarDigital Library
78. Chifu Yang, Qitao Huang, Hongzhou Jiang, O Ogbobe Peter, and Junwei Han. 2010. PD control with gravity compensation for hydraulic 6-DOF parallel manipulator. Mechanism and Machine theory 45, 4 (2010), 666–677.Google Scholar
79. Wei Yang, Wanli Ouyang, Xiaolong Wang, Jimmy Ren, Hongsheng Li, and Xiaogang Wang. 2018. 3D Human Pose Estimation in the Wild by Adversarial Learning. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
80. Ye Yuan and Kris Kitani. 2020. Residual Force Control for Agile Human Behavior Imitation and Extended Motion Synthesis. In Advances in Neural Information Processing Systems (NeurIPS).Google Scholar
81. Andrei Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. 2018. Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes – The Importance of Multiple Scene Constraints. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
82. Petrissa Zell, Bodo Rosenhahn, and Bastian Wandt. 2020. Weakly-Supervised Learning of Human Dynamics. In European Conference on Computer Vision (ECCV).Google Scholar
83. Petrissa Zell, Bastian Wandt, and Bodo Rosenhahn. 2017. Joint 3D Human Motion Capture and Physical Analysis from Monocular Videos. In Computer Vision and Pattern Recognition Workshops (CVPRW).Google ScholarCross Ref
84. Jason Y. Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, and Angjoo Kanazawa. 2020. Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild. In European Conference on Computer Vision (ECCV).Google ScholarDigital Library
85. Yu Zheng and Katsu Yamane. 2013. Human Motion Tracking Control with Strict Contact Force Constraints for Floating-Base Humanoid Robots. In International Conference on Humanoid Robots (Humanoids).Google Scholar
86. Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. 2017. Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
87. Yuliang Zou, Jimei Yang, Duygu Ceylan, Jianming Zhang, Federico Perazzi, and Jia-Bin Huang. 2020. Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints. In Winter Conference on Applications of Computer Vision (WACV).Google Scholar