“EgoCap: egocentric marker-less motion capture with two fisheye cameras” by Rhodin, Richardt, Casas, Insafutdinov, Shafiei, et al. …
Conference:
Type(s):
Title:
- EgoCap: egocentric marker-less motion capture with two fisheye cameras
Session/Category Title: Human Motion
Presenter(s)/Author(s):
Abstract:
Marker-based and marker-less optical skeletal motion-capture methods use an outside-in arrangement of cameras placed around a scene, with viewpoints converging on the center. They often create discomfort with marker suits, and their recording volume is severely restricted and often constrained to indoor scenes with controlled backgrounds. Alternative suit-based systems use several inertial measurement units or an exoskeleton to capture motion with an inside-in setup, i.e. without external sensors. This makes capture independent of a confined volume, but requires substantial, often constraining, and hard to set up body instrumentation. Therefore, we propose a new method for real-time, marker-less, and egocentric motion capture: estimating the full-body skeleton pose from a lightweight stereo pair of fisheye cameras attached to a helmet or virtual reality headset – an optical inside-in method, so to speak. This allows full-body motion capture in general indoor and outdoor scenes, including crowded scenes with many people nearby, which enables reconstruction in larger-scale activities. Our approach combines the strength of a new generative pose estimation framework for fisheye views with a ConvNet-based body-part detector trained on a large new dataset. It is particularly useful in virtual reality to freely roam and interact, while seeing the fully motion-captured virtual body.
References:
1. Amin, S., Andriluka, M., Rohrbach, M., and Schiele, B. 2009. Multi-view pictorial structures for 3D human pose estimation. In BMVC.
2. Andriluka, M., Pishchulin, L., Gehler, P., and Schiele, B. 2014. 2D human pose estimation: New benchmark and state of the art analysis. In CVPR.
3. Baak, A., Müller, M., Bharaj, G., Seidel, H.-P., and Theobalt, C. 2011. A data-driven approach for real-time full body pose reconstruction from a depth camera. In ICCV.
4. Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., and Ilic, S. 2014. 3D pictorial structures for multiple human pose estimation. In CVPR.
5. Blanz, V., and Vetter, T. 1999. A morphable model for the synthesis of 3D faces. In SIGGRAPH.
6. Bregler, C., and Malik, J. 1998. Tracking people with twists and exponential maps. In CVPR.
7. Burenius, M., Sullivan, J., and Carlsson, S. 2013. 3D pictorial structures for multiple view articulated pose estimation. In CVPR.
8. Cerezo, E., Pérez, F., Pueyo, X., Seron, F. J., and Sillion, F. X. 2005. A survey on participating media rendering techniques. The Visual Computer 21, 5, 303–328.
9. Chai, J., and Hodgins, J. K. 2005. Performance animation from low-dimensional control signals. ACM Transactions on Graphics 24, 3, 686–696.
10. Chen, X., and Yuille, A. L. 2014. Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS.
11. EgoCap, 2016. EgoCap dataset. http://gvv.mpi-inf.mpg.de/projects/EgoCap/.
12. Elhayek, A., de Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., Bregler, C., Schiele, B., and Theobalt, C. 2015. Efficient ConvNet-based markerless motion capture in general scenes with a low number of cameras. In CVPR.
13. Fathi, A., Farhadi, A., and Rehg, J. M. 2011. Understanding egocentric activities. In ICCV.
14. Gall, J., Rosenhahn, B., Brox, T., and Seidel, H.-P. 2010. Optimization and filtering for human motion capture. International Journal of Computer Vision 87, 1–2, 75–92.
15. Ha, S., Bai, Y., and Liu, C. K. 2011. Human motion reconstruction from force sensors. In SCA.
16. He, K., Zhang, X., Ren, S., and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
17. Holte, M. B., Tran, C., Trivedi, M. M., and Moeslund, T. B. 2012. Human pose estimation and activity recognition from multi-view videos: Comparative explorations of recent developments. IEEE Journal of Selected Topics in Signal Processing 6, 5, 538–552. Cross Ref
18. Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., and Schiele, B. 2016. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV.
19. Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., and Bregler, C. 2014. Learning human pose estimation features with convolutional networks. In ICLR.
20. Jain, A., Tompson, J., LeCun, Y., and Bregler, C. 2015. MoDeep: A deep learning framework using motion features for human pose estimation. In ACCV.
21. Jiang, H., and Grauman, K. 2016. Seeing invisible poses: Estimating 3D body pose from egocentric video. arXiv:1603.07763.
22. Johnson, S., and Everingham, M. 2011. Learning effective human pose estimation from inaccurate annotation. In CVPR.
23. Jones, A., Fyffe, G., Yu, X., Ma, W.-C., Busch, J., Ichikari, R., Bolas, M., and Debevec, P. 2011. Head-mounted photometric stereo for performance capture. In CVMP.
24. Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., and Sheikh, Y. 2015. Panoptic studio: A massively multiview system for social motion capture. In ICCV.
25. Kim, D., Hilliges, O., Izadi, S., Butler, A. D., Chen, J., Oikonomidis, I., and Olivier, P. 2012. Digits: Freehand 3D interactions anywhere using a wrist-worn gloveless sensor. In UIST.
26. Kitani, K. M., Okabe, T., Sato, Y., and Sugimoto, A. 2011. Fast unsupervised ego-action learning for first-person sports videos. In CVPR.
27. Loper, M., Mahmood, N., and Black, M. J. 2014. MoSh: Motion and shape capture from sparse markers. ACM Transactions on Graphics 33, 6, 220:1–13.
28. Ma, M., Fan, H., and Kitani, K. M. 2016. Going deeper into first-person activity recognition. In CVPR.
29. Meka, A., Zollhöfer, M., Richardt, C., and Theobalt, C. 2016. Live intrinsic video. ACM Transactions on Graphics 35, 4, 109:1–14.
30. Menache, A. 2010. Understanding Motion Capture for Computer Animation, 2nd ed. Morgan Kaufmann.
31. Moeslund, T. B., Hilton, A., Krüger, V., and Sigal, L., Eds. 2011. Visual Analysis of Humans: Looking at People. Springer.
32. Moulon, P., Monasse, P., and Marlet, R. 2013. Global fusion of relative motions for robust, accurate and scalable structure from motion. In ICCV.
33. Murray, R. M., Sastry, S. S., and Zexiang, L. 1994. A Mathematical Introduction to Robotic Manipulation. CRC Press.
34. Newell, A., Yang, K., and Deng, J. 2016. Stacked hourglass networks for human pose estimation. arXiv:1603.06937.
35. Ohnishi, K., Kanehira, A., Kanezaki, A., and Harada, T. 2016. Recognizing activities of daily living with a wrist-mounted camera. In CVPR.
36. Park, S. I., and Hodgins, J. K. 2008. Data-driven modeling of skin and muscle deformation. ACM Transactions on Graphics 27, 3, 96:1–6.
37. Park, H. S., Jain, E., and Sheikh, Y. 2012. 3D social saliency from head-mounted cameras. In NIPS.
38. Pfister, T., Charles, J., and Zisserman, A. 2015. Flowing ConvNets for human pose estimation in videos. In ICCV.
39. Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P., and Schiele, B. 2016. Deep-Cut: Joint subset partition and labeling for multi person pose estimation. In CVPR.
40. Pons-Moll, G., Baak, A., Helten, T., Müller, M., Seidel, H.-P., and Rosenhahn, B. 2010. Multisensor-fusion for 3D full-body human motion capture. In CVPR.
41. Pons-Moll, G., Baak, A., Gall, J., Leal-Taixé, L., Müller, M., Seidel, H.-P., and Rosenhahn, B. 2011. Outdoor human motion capture using inverse kinematics and von Mises-Fisher sampling. In ICCV.
42. Pons-Moll, G., Fleet, D. J., and Rosenhahn, B. 2014. Posebits for monocular human pose estimation. In CVPR.
43. Rhinehart, N., and Kitani, K. M. 2016. Learning action maps of large environments via first-person vision. In CVPR.
44. Rhodin, H., Robertini, N., Richardt, C., Seidel, H.-P., and Theobalt, C. 2015. A versatile scene model with differentiable visibility applied to generative pose estimation. In ICCV.
45. Rhodin, H., Robertini, N., Casas, D., Richardt, C., Seidel, H.-P., and Theobalt, C. 2016. General automatic human shape and motion capture using volumetric contour cues. In ECCV.
46. Rogez, G., Khademi, M., Supancic, III, J. S., Montiel, J. M. M., and Ramanan, D. 2014. 3D hand pose detection in egocentric RGB-D images. In ECCV Workshops.
47. Sapp, B., and Taskar, B. 2013. MODEC: Multimodal decomposable models for human pose estimation. In CVPR.
48. Scaramuzza, D., Martinelli, A., and Siegwart, R. 2006. A toolbox for easily calibrating omnidirectional cameras. In IROS.
49. Shiratori, T., Park, H. S., Sigal, L., Sheikh, Y., and Hodgins, J. K. 2011. Motion capture from body-mounted cameras. ACM Transactions on Graphics 30, 4, 31:1–10.
50. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A. 2011. Real-time human pose recognition in parts from single depth images. In CVPR.
51. Sigal, L., Bălan, A. O., and Black, M. J. 2010. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87, 4–27.
52. Sigal, L., Isard, M., Haussecker, H., and Black, M. J. 2012. Loose-limbed people: Estimating 3D human pose and motion using non-parametric belief propagation. International Journal of Computer Vision 98, 1, 15–48.
53. Sridhar, S., Mueller, F., Oulasvirta, A., and Theobalt, C. 2015. Fast and robust hand tracking using detection-guided optimization. In CVPR.
54. Stoll, C., Hasler, N., Gall, J., Seidel, H.-P., and Theobalt, C. 2011. Fast articulated motion tracking using a sums of Gaussians body model. In ICCV.
55. Su, Y.-C., and Grauman, K. 2016. Detecting engagement in egocentric video. In ECCV.
56. Sugano, Y., and Bulling, A. 2015. Self-calibrating head-mounted eye trackers using egocentric visual saliency. In UIST.
57. Tautges, J., Zinke, A., Krüger, B., Baumann, J., Weber, A., Helten, T., Müller, M., Seidel, H.-P., and Eberhardt, B. 2011. Motion reconstruction using sparse accelerometer data. ACM Transactions on Graphics 30, 3, 18:1–12.
58. Tekin, B., Rozantsev, A., Lepetit, V., and Fua, P. 2016. Direct prediction of 3D body poses from motion compensated sequences. In CVPR.
59. Theobalt, C., de Aguiar, E., Stoll, C., Seidel, H.-P., and Thrun, S. 2010. Performance capture from multi-view video. In Image and Geometry Processing for 3-D Cinematography, R. Ronfard and G. Taubin, Eds. Springer, 127–149.
60. Tompson, J. J., Jain, A., LeCun, Y., and Bregler, C. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS.
61. Toshev, A., and Szegedy, C. 2014. DeepPose: Human pose estimation via deep neural networks. In CVPR.
62. Urtasun, R., Fleet, D. J., and Fua, P. 2006. Temporal motion models for monocular and multiview 3D human body tracking. Computer Vision and Image Understanding 104, 2, 157–177.
63. Vlasic, D., Adelsberger, R., Vannucci, G., Barnwell, J., Gross, M., Matusik, W., and Popović, J. 2007. Practical motion capture in everyday surroundings. ACM Transactions on Graphics 26, 3, 35.
64. Wang, R. Y., and Popović, J. 2009. Real-time hand-tracking with a color glove. ACM Transactions on Graphics 28, 3, 63.
65. Wang, J., Cheng, Y., and Feris, R. S. 2016. Walk and learn: Facial attribute representation learning from egocentric video and contextual data. In CVPR.
66. Wei, X., Zhang, P., and Chai, J. 2012. Accurate realtime full-body motion capture using a single depth camera. ACM Transactions on Graphics 31, 6, 188:1–12.
67. Wei, S.-E., Ramakrishna, V., Kanade, T., and Sheikh, Y. 2016. Convolutional pose machines. In CVPR.
68. Yang, Y., and Ramanan, D. 2013. Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12, 2878–2890.
69. Yasin, H., Iqbal, U., Krüger, B., Weber, A., and Gall, J. 2016. A dual-source approach for 3D pose estimation from a single image. In CVPR.
70. Yin, K., and Pai, D. K. 2003. Footsee: an interactive animation system. In SCA.
71. Yonemoto, H., Murasaki, K., Osawa, T., Sudo, K., Shimamura, J., and Taniguchi, Y. 2015. Egocentric articulated pose tracking for action recognition. In International Conference on Machine Vision Applications (MVA).
72. Zhang, P., Siu, K., Zhang, J., Liu, C. K., and Chai, J. 2014. Leveraging depth cameras and wearable pressure sensors for full-body kinematics and dynamics capture. ACM Transactions on Graphics 33, 6, 221:1–14.


