“Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture” by Pan, Ma, Yi, Hu, Wang, et al. … – ACM SIGGRAPH HISTORY ARCHIVES

“Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture” by Pan, Ma, Yi, Hu, Wang, et al. …

  • 2023 SA_Technical_Papers_Pan_Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture

Conference:


Type(s):


Title:

    Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture

Session/Category Title:   Motion Capture and Reconstruction


Presenter(s)/Author(s):



Abstract:


    Either RGB images or inertial signals have been used for the task of motion capture (mocap), but combining them together is a new and interesting topic. We believe that the combination is complementary and able to solve the inherent difficulties of using one modality input, including occlusions, extreme lighting/texture, and out-of-view for visual mocap and global drifts for inertial mocap. To this end, we propose a method that fuses monocular images and sparse IMUs for real-time human motion capture. Our method contains a dual coordinate strategy to fully explore the IMU signals with different goals in motion capture. To be specific, besides one branch transforming the IMU signals to the camera coordinate system to combine with the image information, there is another branch to learn from the IMU signals in the body root coordinate system to better estimate body poses. Furthermore, a hidden state feedback mechanism is proposed for both two branches to compensate for their own drawbacks in extreme input cases. Thus our method can easily switch between the two kinds of signals or combine them in different cases to achieve a robust mocap. Quantitative and qualitative results demonstrate that by delicately designing the fusion method, our technique significantly outperforms the state-of-the-art vision, IMU, and combined methods on both global orientation and local pose estimation.

References:


    [1]
    Sadegh Aliakbarian, Pashmina Cameron, Federica Bogo, Andrew Fitzgibbon, and Thomas J Cashman. 2022. Flag: Flow-based 3d avatar generation from sparse observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13253–13262.

    [2]
    Yiming Bao, Xu Zhao, and Dahong Qian. 2022. FusePose: IMU-Vision Sensor Fusion in Kinematic Space for Parametric Human Pose Estimation. IEEE Transactions on Multimedia (2022).

    [3]
    Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. 2016a. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In Computer Vision – ECCV 2016(Lecture Notes in Computer Science). Springer International Publishing.

    [4]
    Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. 2016b. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 561–578.

    [5]
    Sungho Chun, Sungbum Park, and Ju Yong Chang. 2023. Learnable human mesh triangulation for 3D human pose and shape estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2850–2859.

    [6]
    Stuart Ganan and D McClure. 1985. Bayesian image analysis: An application to single photon emission tomography. Amer. Statist. Assoc (1985), 12–18.

    [7]
    Andrew Gilbert, Matthew Trumble, Charles Malleson, Adrian Hilton, and John Collomosse. 2019. Fusing visual and inertial sensors with semantics for 3d human pose estimation. International Journal of Computer Vision 127, 4 (2019), 381–397.

    [8]
    Thomas Helten, Meinard Muller, Hans-Peter Seidel, and Christian Theobalt. 2013. Real-time body tracking with one depth camera and inertial sensors. In Proceedings of the IEEE international conference on computer vision. 1105–1112.

    [9]
    Roberto Henschel, Timo von Marcard, and Bodo Rosenhahn. 2019. Simultaneous identification and tracking of multiple people using video and imus. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 0–0.

    [10]
    Roberto Henschel, Timo Von Marcard, and Bodo Rosenhahn. 2020. Accurate long-term multiple people tracking using video and body-worn IMUs. IEEE Transactions on Image Processing 29 (2020), 8476–8489.

    [11]
    Buzhen Huang, Yuan Shu, Jingyi Ju, and Yangang Wang. 2022. Occluded Human Body Capture with Self-Supervised Spatial-Temporal Motion Prior. arXiv preprint arXiv:2207.05375 (2022).

    [12]
    Fuyang Huang, Ailing Zeng, Minhao Liu, Qiuxia Lai, and Qiang Xu. 2020. Deepfuse: An imu-aware network for real-time 3d human pose estimation from multi-view image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 429–438.

    [13]
    Yinghao Huang, Manuel Kaufmann, Emre Aksan, Michael J. Black, Otmar Hilliges, and Gerard Pons-Moll. 2018. Deep Inertial Poser Learning to Reconstruct Human Pose from SparseInertial Measurements in Real Time. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) 37 (nov 2018).

    [14]
    Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. 2019. Learnable Triangulation of Human Pose. international conference on computer vision (2019).

    [15]
    Jiaxi Jiang, Paul Streli, Huajian Qiu, Andreas Fender, Larissa Laich, Patrick Snape, and Christian Holz. 2022a. Avatarposer: Articulated full-body pose tracking from sparse motion sensing. In European Conference on Computer Vision. Springer, 443–460.

    [16]
    Yifeng Jiang, Yuting Ye, Deepak Gopinath, Jungdam Won, Alexander W. Winkler, and C. Karen Liu. 2022b. Transformer Inertial Poser: Real-Time Human Motion Reconstruction from Sparse IMUs with Simultaneous Terrain Generation. In SIGGRAPH Asia 2022 Conference Papers (Daegu, Republic of Korea) (SA ’22 Conference Papers). Article 3, 9 pages. https://doi.org/10.1145/3550469.3555428

    [17]
    Tomoya Kaichi, Tsubasa Maruyama, Mitsunori Tada, and Hideo Saito. 2020. Resolving position ambiguity of imu-based human pose with a single rgb camera. Sensors 20, 19 (2020), 5453.

    [18]
    Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. 2018. End-to-end Recovery of Human Shape and Pose. In Computer Vision and Pattern Regognition (CVPR).

    [19]
    Angjoo Kanazawa, Jason Y Zhang, Panna Felsen, and Jitendra Malik. 2019. Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5614–5623.

    [20]
    Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. 2020. VIBE: Video Inference for Human Body Pose and Shape Estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

    [21]
    Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges, and Michael J. Black. 2021. PARE: Part Attention Regressor for 3D Human Body Estimation. In Proc. International Conference on Computer Vision (ICCV). 11127–11137.

    [22]
    Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. 2019b. Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop. In ICCV.

    [23]
    Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. 2019a. Convolutional mesh regression for single-image human shape reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4501–4510.

    [24]
    Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J. Black, and Peter V. Gehler. 2017. Unite the People: Closing the Loop Between 3D and 2D Human Representations. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). http://up.is.tuebingen.mpg.de

    [25]
    Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. 2021. Learn to Dance with AIST++: Music Conditioned 3D Dance Generation. arxiv:2101.08779 [cs.CV]

    [26]
    Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu, and Youliang Yan. 2022. Cliff: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision. Springer, 590–606.

    [27]
    Han Liang, Yannan He, Chengfeng Zhao, Mutian Li, Jingya Wang, Jingyi Yu, and Lan Xu. 2022. HybridCap: Inertia-aid Monocular Capture of Challenging Human Motions. arXiv preprint arXiv:2203.09287 (2022).

    [28]
    Dong C Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical programming 45, 1-3 (1989), 503–528.

    [29]
    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: a skinned multi-person linear model. international conference on computer graphics and interactive techniques (2015).

    [30]
    Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan-Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann. 2019. MediaPipe: A Framework for Building Perception Pipelines. arxiv:1906.08172 [cs.DC]

    [31]
    Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. In International Conference on Computer Vision. 5442–5451.

    [32]
    Charles Malleson, John Collomosse, and Adrian Hilton. 2020. Real-time multi-person motion capture from multi-view video and IMUs. International Journal of Computer Vision 128, 6 (2020), 1594–1611.

    [33]
    Charles Malleson, Andrew Gilbert, Matthew Trumble, John Collomosse, Adrian Hilton, and Marco Volino. 2017. Real-Time Full-Body Motion Capture from Video and IMUs. international conference on 3d vision (2017).

    [34]
    Md Moniruzzaman, Zhaozheng Yin, Md Sanzid Bin Hossain, Zhishan Guo, and Hwan Choi. 2021. Wearable Motion Capture: Reconstructing and Predicting 3D Human Poses from Wearable Sensors. TechRxviz. Preprint (2021).

    [35]
    Deepak Nagaraj, Erik Schake, Patrick Leiner, and Dirk Werth. 2020. An RNN-Ensemble Approach for Real Time Human Pose Estimation from Sparse IMUs. In Proceedings of the 3rd International Conference on Applications of Intelligent Systems (Las Palmas de Gran Canaria, Spain) (APPIS 2020). Association for Computing Machinery, New York, NY, USA, Article 32, 6 pages. https://doi.org/10.1145/3378184.3378228

    [36]
    Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis. 2017. Harvesting Multiple Views for Marker-less 3D Human Pose Annotations. In Computer Vision and Pattern Recognition (CVPR).

    [37]
    Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. 2018. Learning to estimate 3D human pose and shape from a single color image. In Proceedings of the IEEE conference on computer vision and pattern recognition. 459–468.

    [38]
    Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7753–7762.

    [39]
    Gerard Pons-Moll, Andreas Baak, Thomas Helten, Meinard Müller, Hans-Peter Seidel, and Bodo Rosenhahn. 2010. Multisensor-fusion for 3d full-body human motion capture. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 663–670.

    [40]
    Haibo Qiu, Chunyu Wang, Jingdong Wang, Naiyan Wang, and Wenjun Zeng. 2019. Cross View Fusion for 3D Human Pose Estimation. international conference on computer vision (2019).

    [41]
    N Dinesh Reddy, Laurent Guigues, Leonid Pishchulin, Jayan Eledath, and Srinivasa G Narasimhan. 2021. Tessetrack: End-to-end learnable multi-person articulated 3d pose tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15190–15200.

    [42]
    Edoardo Remelli, Shangchen Han, Sina Honari, Pascal Fua, and Robert Wang. 2020. Lightweight multi-view 3d pose estimation through camera-disentangled representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6040–6049.

    [43]
    Martin Schepers, Matteo Giuberti, Giovanni Bellusci, 2018. Xsens MVN: Consistent tracking of human motion using inertial sensing. Xsens Technol 1, 8 (2018).

    [44]
    Paul Schreiner, Maksym Perepichka, Hayden Lewis, Sune Darkner, Paul G Kry, Kenny Erleben, and Victor B Zordan. 2021. Global position prediction for interactive motion capture. Proceedings of the ACM on Computer Graphics and Interactive Techniques 4, 3 (2021), 1–16.

    [45]
    Saurabh Sharma, Pavan Teja Varigonda, Prashast Bindal, Abhishek Sharma, and Arjun Jain. 2019. Monocular 3d human pose estimation by generation and ordinal ranking. In Proceedings of the IEEE/CVF international conference on computer vision. 2325–2334.

    [46]
    Soshi Shimada, Vladislav Golyanik, Weipeng Xu, Patrick Pérez, and Christian Theobalt. 2021. Neural Monocular 3D Human Motion Capture with Physical Awareness. ACM Transactions on Graphics 40 (aug 2021).

    [47]
    Soshi Shimada, Vladislav Golyanik, Weipeng Xu, and Christian Theobalt. 2020. PhysCap: Physically Plausible Monocular 3D Motion Capture in Real Time. ACM Trans. Graph. 39, 6, Article 235 (nov 2020), 16 pages. https://doi.org/10.1145/3414685.3417877

    [48]
    Yu Sun, Qian Bao, Wu Liu, Yili Fu, Black Michael J., and Tao Mei. 2021. Monocular, One-stage, Regression of Multiple 3D People. In ICCV.

    [49]
    Yu Sun, Wu Liu, Qian Bao, Yili Fu, Tao Mei, and Michael J. Black. 2022. Putting People in their Place: Monocular Regression of 3D People in Depth. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR).

    [50]
    Matt Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and John Collomosse. 2017. Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors. In 2017 British Machine Vision Conference (BMVC).

    [51]
    Timo von Marcard, Roberto Henschel, Michael J. Black, Bodo Rosenhahn, and Gerard Pons-Moll. 2018. Recovering Accurate {3D} Human Pose in the Wild Using {IMUs} and a Moving Camera. european conference on computer vision (2018).

    [52]
    Timo von Marcard, Gerard Pons-Moll, and Bodo Rosenhahn. 2016. Human Pose Estimation from Video and IMUs. Transactions on Pattern Analysis and Machine Intelligence (PAMI) (jan 2016).

    [53]
    Timo von Marcard, Bodo Rosenhahn, Michael Black, and Gerard Pons-Moll. 2017. Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs. Computer Graphics Forum 36(2), Proceedings of the 38th Annual Conference of the European Association for Computer Graphics (Eurographics) (2017).

    [54]
    Alexander Winkler, Jungdam Won, and Yuting Ye. 2022. QuestSim: Human Motion Tracking from Sparse Sensors with Simulated Avatars. In SIGGRAPH Asia 2022 Conference Papers. 1–8.

    [55]
    Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. 2019. Monocular total capture: Posing face, body, and hands in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10965–10974.

    [56]
    Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xiaokang Yang, and Wenjun Zhang. 2020. Deep kinematics analysis for monocular 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on computer vision and Pattern recognition. 899–908.

    [57]
    Yongjing Ye, Libin Liu, Lei Hu, and Shihong Xia. 2022. Neural3Points: Learning to Generate Physically Realistic Full-body Motion for Virtual Reality Users. In Computer Graphics Forum, Vol. 41. Wiley Online Library, 183–194.

    [58]
    Xinyu Yi, Yuxiao Zhou, Marc Habermann, Vladislav Golyanik, Shaohua Pan, Christian Theobalt, and Feng Xu. 2023. EgoLocate: Real-time Motion Capture, Localization, and Mapping with Sparse Body-mounted Sensors. arXiv preprint arXiv:2305.01599 (2023).

    [59]
    Xinyu Yi, Yuxiao Zhou, Marc Habermann, Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Feng Xu. 2022. Physical Inertial Poser (PIP): Physics-aware Real-time Human Motion Tracking from Sparse Inertial Sensors. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

    [60]
    Xinyu Yi, Yuxiao Zhou, and Feng Xu. 2021. TransPose: Real-time 3D Human Translation and Pose Estimation with Six Inertial Sensors. ACM Transactions on Graphics 40 (08 2021).

    [61]
    Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. 2022. GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

    [62]
    Andrei Zanfir, Eduard Gabriel Bazavan, Mihai Zanfir, William T Freeman, Rahul Sukthankar, and Cristian Sminchisescu. 2021. Neural descent for visual 3d human pose and shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14484–14493.

    [63]
    Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Junsong Yuan. 2022. MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13232–13242.

    [64]
    T. Zhang, B. Huang, and Y. Wang. 2020a. Object-Occluded Human Shape and Pose Estimation From a Single Color Image. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

    [65]
    Zhe Zhang, Chunyu Wang, Wenhu Qin, and Wenjun Zeng. 2020b. Fusing Wearable IMUs With Multi-View Images for Human Pose Estimation: A Geometric Approach. computer vision and pattern recognition (2020).

    [66]
    Jianan Zhen, Qi Fang, Jiaming Sun, Wentao Liu, Wei Jiang, Hujun Bao, and Xiaowei Zhou. 2020. Smap: Single-shot multi-person absolute 3d pose estimation. In European Conference on Computer Vision. Springer, 550–566.

    [67]
    Zerong Zheng, Tao Yu, Hao Li, Kaiwen Guo, Qionghai Dai, Lu Fang, and Yebin Liu. 2018. Hybridfusion: Real-time performance capture using a single depth sensor and sparse imus. In Proceedings of the European Conference on Computer Vision (ECCV). 384–400.

    [68]
    Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. 2019. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5745–5753.

    [69]
    Yuxiao Zhou, Marc Habermann, Ikhsanul Habibie, Ayush Tewari, Christian Theobalt, and Feng Xu. 2021. Monocular Real-time Full Body Capture with Inter-part Correlations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).


ACM Digital Library Publication:



Overview Page:



Submit a story:

If you would like to submit a story about this presentation, please contact us: historyarchives@siggraph.org