“The need 4 speed in real-time dense visual tracking”
Conference:
Type(s):
Title:
- The need 4 speed in real-time dense visual tracking
Session/Category Title: Acquiring and editing, geometry via RGB (D) images
Presenter(s)/Author(s):
- Adarsh Kowdle
- Christoph Rhemann
- Sean Ryan Fanello
- Andrea Tagliasacchi
- Jonathan Taylor
- Philip L. Davidson
- Mingsong Dou
- Kaiwen Guo
- Cem Keskin
- Sameh Khamis
- David Kim
- Danhang Tang
- Vladimir Tankovich
- Julien Valentin
- Shahram Izadi
Moderator(s):
Abstract:
The advent of consumer depth cameras has incited the development of a new cohort of algorithms tackling challenging computer vision problems. The primary reason is that depth provides direct geometric information that is largely invariant to texture and illumination. As such, substantial progress has been made in human and object pose estimation, 3D reconstruction and simultaneous localization and mapping. Most of these algorithms naturally benefit from the ability to accurately track the pose of an object or scene of interest from one frame to the next. However, commercially available depth sensors (typically running at 30fps) can allow for large inter-frame motions to occur that make such tracking problematic. A high frame rate depth camera would thus greatly ameliorate these issues, and further increase the tractability of these computer vision problems. Nonetheless, the depth accuracy of recent systems for high-speed depth estimation [Fanello et al. 2017b] can degrade at high frame rates. This is because the active illumination employed produces a low SNR and thus a high exposure time is required to obtain a dense accurate depth image. Furthermore in the presence of rapid motion, longer exposure times produce artifacts due to motion blur, and necessitates a lower frame rate that introduces large inter-frame motion that often yield tracking failures. In contrast, this paper proposes a novel combination of hardware and software components that avoids the need to compromise between a dense accurate depth map and a high frame rate. We document the creation of a full 3D capture system for high speed and quality depth estimation, and demonstrate its advantages in a variety of tracking and reconstruction tasks. We extend the state of the art active stereo algorithm presented in Fanello et al. [2017b] by adding a space-time feature in the matching phase. We also propose a machine learning based depth refinement step that is an order of magnitude faster than traditional postprocessing methods. We quantitatively and qualitatively demonstrate the benefits of the proposed algorithms in the acquisition of geometry in motion. Our pipeline executes in 1.1ms leveraging modern GPUs and off-the-shelf cameras and illumination components. We show how the sensor can be employed in many different applications, from [non-]rigid reconstructions to hand/face tracking. Further, we show many advantages over existing state of the art depth camera technologies beyond framerate, including latency, motion artifacts, multi-path errors, and multi-sensor interference.
References:
1. A. Bhandari, A. Kadambi, R. Whyte, C. Barsi, M. Feigin, A.A. Dorrington, and R. Raskar. 2014. Resolving Multi-path Interference in Time-of-Flight Imaging via Modulation Frequency Diversity and Sparse Regularization. CoRR (2014).Google Scholar
2. Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 187–194. Google ScholarDigital Library
3. M. Bleyer, C. Rhemann, and C. Rother. 2011. PatchMatch Stereo – Stereo Matching with Slanted Support Windows. In BMVC.Google Scholar
4. Michael Bleyer, Christoph Rhemann, and Carsten Rother. 2012. Extracting 3D Scene-consistent Object Proposals and Depth from Stereo Images. In ECCV. Google ScholarDigital Library
5. D. Alex Butler, Shahram Izadi, Otmar Hilliges, David Molyneaux, Steve Hodges, and David Kim. 2012. Shake’N’Sense: Reducing Interference for Overlapping Structured Light Depth Cameras. In CHI. Google ScholarDigital Library
6. Yang Chen and Gérard Medioni. 1992. Object modelling by registration of multiple range images. Image and vision computing 10, 3 (1992), 145–155. Google ScholarDigital Library
7. C. Ciliberto, S. R. Fanello, L. Natale, and G. Metta. 2012. A heteroscedastic approach to independent motion detection for actuated visual sensors. In IROS.Google Scholar
8. L. Arthur D’Asaro, Jean-Francois Seurin, and James D. Wynn. 2016. The VCSEL Advantage: Increased Power, Efficiency Bring New Applications. (2016).Google Scholar
9. James Davis, Diego Nehab, Ravi Ramamoorthi, and Szymon Rusinkiewicz. 2005. Space-time Stereo: A Unifying Framework for Depth from Triangulation. PAMI (2005). Google ScholarDigital Library
10. Mingsong Dou, Philip Davidson, Sean Ryan Fanello, Sameh Khamis, Adarsh Kowdle, Christoph Rhemann, Vladimir Tankovich, and Shahram Izadi. 2017. Motion2Fusion: Real-time Volumetric Performance Capture. ACM Trans. on Graphics (Proc. SIGGRAPH Asia) (2017). Google ScholarDigital Library
11. Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, Pushmeet Kohli, Vladimir Tankovich, and Shahram Izadi. 2016. Fusion4D: Real-time Performance Capture of Challenging Scenes. (2016).Google Scholar
12. S. R. Fanello, I. Gori, G. Metta, and F. Odone. 2013a. Keep it simple and sparse: Real-time action recognition. JMLR (2013). Google ScholarDigital Library
13. Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, and Francesca Odone. 2013b. One-Shot Learning for Real-Time Action Recognition. In IbPRIA.Google Scholar
14. Sean Ryan Fanello, Ugo Pattacini, Ilaria Gori, Vadim Tikhanoff, Marco Randazzo, Alessandro Roncone, Francesca Odone, and Giorgio Metta. 2014. 3D Stereo Estimation and Fully Automated Learning of Eye-Hand Coordination in Humanoid Robots. In IEEE-RAS International Conference on Humanoid Robots.Google ScholarCross Ref
15. Sean Ryan Fanello, Christoph Rhemann, Vladimir Tankovich, A Kowdle, S Orts Escolano, D Kim, and S Izadi. 2016. Hyperdepth: Learning depth from structured light without matching. CVPR (2016).Google Scholar
16. Sean Ryan Fanello, Julien Valentin, Adarsh Kowdle, Christoph Rhemann, Vladimir Tankovich, Carlo Ciliberto, Philip Davidson, and Shahram Izadi. 2017a. Low Compute and Fully Parallel Computer Vision with HashMatch. Proc. of ICCV (2017).Google ScholarCross Ref
17. Sean Ryan Fanello, Julien Valentin, Christoph Rhemann, Adarsh Kowdle, Vladimir Tankovich, and Shahram Izadi. 2017b. UltraStereo: Efficient Learning-based Matching for Active Stereo Systems. CVPR (2017).Google Scholar
18. Christian Forster, Matia Pizzoli, and Davide Scaramuzza. 2014. SVO: Fast semi-direct monocular visual odometry. In ICRA.Google Scholar
19. D. Freedman, E. Krupka, Y. Smolin, I. Leichter, and M. Schmidt. 2014. SRA: Fast Removal of General Multipath for ToF Sensors. ECCV (2014).Google Scholar
20. Jason Geng. 2011. Structured-light 3D surface imaging: a tutorial. Advances in Optics and Photonics 3, 2 (2011), 128–160.Google ScholarCross Ref
21. Yuanzheng Gong and Song Zhang. 2010. Ultrafast 3-D shape measurement with an off-the-shelf DLP projector. Optics express (2010).Google Scholar
22. I. Gori, U. Pattacini, V. Tikhanoff, and G. Metta. 2013. Ranking the Good Points: A Comprehensive Method for Humanoid Robots to Grasp Unknown Objects. In IEEE ICAR.Google Scholar
23. Kaiwen Guo, Jonathan Taylor, Sean Fanello, Andrea Tagliasacchi, Mingsong Dou, Philip Davidson, Adarsh Kowdle, and Shahram Izadi. 2018. TwinFusion: High Framerate Non-Rigid Fusion through Fast Correspondence Tracking. In 3DV.Google Scholar
24. Mohit Gupta, Shree K Nayar, Matthias B Hullin, and Jaime Martin. 2015. Phasor imaging: A generalization of correlation-based time-of-flight imaging. ACM TOG (2015). Google ScholarDigital Library
25. Ankur Handa. 2013. Analysing high frame-rate camera tracking. Ph.D. Dissertation. Imperial College London.Google Scholar
26. Ankur Handa, Richard A. Newcombe, Adrien Angeli, and Andrew J. Davison. 2012. Real-Time Camera Tracking: When is High Frame-rate Best?. In ECCV. Google ScholarDigital Library
27. Roland Höfling, Petra Aswendt, Frank Leischnig, and Matthias Förster. 2015. Characteristics of digital micromirror projection for 3D shape measurement at extreme speed. In SPIE OPTO. International Society for Optics and Photonics.Google Scholar
28. Berthold KP Horn and Brian G Schunck. 1981. Determining optical flow. Artificial intelligence 17, 1–3 (1981), 185–203. Google ScholarDigital Library
29. Asmaa Hosni, Christoph Rhemann, Michael Bleyer, Carsten Rother, and Margrit Gelautz. 2013. Fast Cost-Volume Filtering for Visual Correspondence and Beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 35, 2 (2013), 504–511. Google ScholarDigital Library
30. Jae-Sang Hyun, Beiwen Li, and Song Zhang. 2017. High-speed high-accuracy three-dimensional shape measurement using digital binary defocusing method versus sinusoidal method. Optical Engineering (2017).Google Scholar
31. Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc Stamminger. 2016. VolumeDeform: Real-time volumetric non-rigid reconstruction. In Proc. of ECCV. 362–379.Google ScholarCross Ref
32. S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon. 2011. KinectFusion: Real-time 3D reconstruction and interaction using a moving depth camera.Google Scholar
33. D. Jimenez, D. Pizarro, M. Mazo, and S. Palazuelos. 2012. Modelling and correction of multipath interference in time of flight cameras. In CVPR. Google ScholarDigital Library
34. L. Keselman, J. Iselin Woodfill, A. Grunnet-Jepsen, and A. Bhowmik. 2017. Intel RealSense Stereoscopic Depth Cameras. CVPR Workshops (2017).Google Scholar
35. C. Keskin, F. Kıraç, Y.E. Kara, and L. Akarun. 2012. Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In ECCV. Google ScholarDigital Library
36. Sameh Khamis, Sean Ryan Fanello, Christoph Rhemann, Julien Valentin, Adarsh Kowdle, and Shahram Izadi. 2018. StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction. ECCV (2018).Google Scholar
37. Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, and Simon Lucey. 2017. Need for Speed: A Benchmark for Higher Frame Rate Object Tracking. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
38. Hanme Kim, Stefan Leutenegger, and Andrew J. Davison. 2016. Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera.Google Scholar
39. Yangyan Li, Angela Dai, Leonidas Guibas, and Matthias Nießner. 2015. Database-Assisted Object Retrieval for Real-Time 3D Reconstruction. In Computer Graphics Forum, Vol. 34. Wiley Online Library. Google ScholarDigital Library
40. Jiajun Lu, Hrvoje Benko, and Andrew D. Wilson. 2017. Hybrid HFR Depth: Fusing Commodity Depth and Color Cameras to Achieve High Frame Rate, Low Latency Depth Camera Interactions. In CHI. Google ScholarDigital Library
41. Holger Moench, Mark Carpaij, Philipp Gerlach, Stephan Gronenborn, Ralph Gudde, Jochen Hellmig, Johanna Kolb, and Alexander van der Lee. 2016. VCSEL-based sensors for distance and velocity. In Proc. International Society for Optics and Photonics.Google Scholar
42. N. Naik, A. Kadambi, C. Rhemann, S. Izadi, R. Raskar, and S.B. Kang. 2015. A Light Transport Model for Mitigating Multipath Interference in TOF Sensors. CVPR (2015).Google Scholar
43. Yoshihiro Nakabo, Masatoshi Ishikawa, Haruyoshi Toyoda, and Seiichiro Mizuno. 2000. 1ms Column Parallel Vision System and Its Application of High Speed Target Tracking. In ICRA.Google Scholar
44. Richard Newcombe. 2012. Dense visual SLAM. Ph.D. Dissertation. Imperial College London, UK.Google Scholar
45. Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. 2011. KinectFusion: Real-time dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on. IEEE, 127–136. Google ScholarDigital Library
46. Kohei Okumura, Hiromasa Oku, and Masatoshi Ishikawa. 2011. High-speed gaze controller for millisecond-order pan/tilt camera. In Proc. of ICRA. 6186–6191.Google ScholarCross Ref
47. Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kowdle, Yury Degtyarev, David Kim, Philip L Davidson, Sameh Khamis, Mingsong Dou, et al. 2016. Holoportation: Virtual 3D Teleportation in Real-time. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, 741–754. Google ScholarDigital Library
48. OSHA. 2017. OSHA Technical Manual (OTM) by United States. Occupational Safety and Health Administration. Office of Science and Technology Assessment.Google Scholar
49. Matthew O’Toole, Felix Heide, Lei Xiao, Matthias B Hullin, Wolfgang Heidrich, and Kiriakos N Kutulakos. 2014. Temporal frequency probing for 5D transient analysis of global light transport. ACM TOG (2014). Google ScholarDigital Library
50. Matthew KXJ Pan and Günter Niemeyer. 2017. Catching a real ball in virtual reality. In Virtual Reality (VR), 2017 IEEE. IEEE, 269–270.Google ScholarCross Ref
51. V. Pradeep, C. Rhemann, S.i Izad, C. Zach, M. Bleyer, and S. Bathiche. 2013. MonoFusion: Real-time 3D Reconstruction of Small Scenes with a Single Web Camera.Google Scholar
52. H. Rebecq, T. Horstschaefer, G. Gallego, and D. Scaramuzza. 2017. EVO: A Geometric Approach to Event-Based 6-DOF Parallel Tracking and Mapping in Real Time. IEEE Robotics and Automation Letters (2017).Google Scholar
53. Christian Reinbacher, Gottfried Munda, and Thomas Pock. 2017. Real-Time Panoramic Tracking for Event Cameras. arXiv preprint arXiv:1703.05161 (2017).Google Scholar
54. RoadToVR. 2016. Analysis of Valve’s ‘Lighthouse’ Tracking System Reveals Accuracy. http://www.roadtovr.com/analysis-of-valves-lighthouse-tracking-system-reveals-accuracy/. (2016).Google Scholar
55. Jannick P. Rolland, Richard L. Holloway, and Henry Fuchs. 1995. Comparison of optical and video see-through, head-mounted displays. (1995).Google Scholar
56. Joaquim Salvi, Sergio Fernandez, Tomislav Pribanic, and Xavier Llado. 2010. A state of the art in structured light patterns for surface profilometry. Pattern recognition 43, 8 (2010), 2666–2680. Google ScholarDigital Library
57. D. Scharstein and R. Szeliski. 2002. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. IJCV 47, 1–3 (April 2002). Google ScholarDigital Library
58. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. 2011. Real-time Human Pose Recognition in Parts from Single Depth Images. In CVPR. Google ScholarDigital Library
59. J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. 2013. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. In Proc. of CVPR. Google ScholarDigital Library
60. Srinath Sridhar, Franziska Mueller, Antti Oulasvirta, and Christian Theobalt. 2015. Fast and Robust Hand Tracking Using Detection-Guided Optimization. In Proceedings of Computer Vision and Pattern Recognition (CVPR). 9. http://handtracker.mpi-inf.mpg.de/projects/FastHandTracker/Google ScholarCross Ref
61. Jan Stuhmer, Sebastian Nowozin, Andrew Fitzgibbon, Richard Szeliski, Travis Perry, Sunil Acharya, Daniel Cremers, and Jamie Shotton. 2015. Model-Based Tracking at 300Hz Using Raw Time-of-Flight Observations. In ICCV. Google ScholarDigital Library
62. Robert W Sumner, Johannes Schmid, and Mark Pauly. 2007. Embedded deformation for shape manipulation. ACM TOG 26, 3 (2007), 80. Google ScholarDigital Library
63. A. Takgi, S. Yamazaki, and H. Fuchs. 2000. Development of a stereo video see-through HMD for AR systems. (2000).Google Scholar
64. J. Taylor, L. Bordeaux, T. Cashman, B. Corish, C. Keskin, T. Sharp, E. Soto, D. Sweeney, J. Valentin, B. Luff, A. Topalian, E. Wood, S. Khamis, P. Kohli, S. Izadi, R. Banks, A. Fitzgibbon, and J. Shotton. 2016. Efficient and Precise Interactive Hand Tracking Through Joint, Continuous Optimization of Pose and Correspondences. SIGGRAPH (2016). Google ScholarDigital Library
65. Jonathan Taylor, Vladimir Tankovich, Danhang Tang, Cem Keskin, David Kim, Philip Davidson, Adarsh Kowdle, and Shahram Izadi. 2017. Articulated Distance Fields for Ultra-Fast Tracking of Hands Interacting. ACM Trans. on Graphics (Proc. SIGGRAPH Asia) (2017). Google ScholarDigital Library
66. Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. FaceVR: Real-Time Facial Reenactment and Eye Gaze Control in Virtual Reality. arXiv preprint arXiv:1610.03151 (2016). Google ScholarDigital Library
67. Anastasia Tkach, Mark Pauly, and Andrea Tagliasacchi. 2016. Sphere-Meshes for Real-Time Hand Modeling and Tracking. ACM Transaction on Graphics (Proc. SIGGRAPH Asia) (2016). Google ScholarDigital Library
68. R. Y. Tsai and R. K. Lenz. 1988. Real time versatile robotics hand/eye calibration using 3D machine vision. In ICRA.Google Scholar
69. Julien Valentin, Vibhav Vineet, Ming-Ming Cheng, David Kim, Jamie Shotton, Pushmeet Kohli, Matthias Nießner, Antonio Criminisi, Shahram Izadi, and Philip Torr. 2015. SemanticPaint: Interactive 3d labeling and learning at your fingertips. ACM Transactions on Graphics (TOG) (2015). Google ScholarDigital Library
70. Shenlong Wang, Sean Ryan Fanello, Christoph Rhemann, Shahram Izadi, and Pushmeet Kohli. 2016. The Global Patch Collider. CVPR (2016).Google Scholar
71. D. Webster and O. Celik. 2014. Experimental evaluation of Microsoft Kinect’s accuracy and capture rate for stroke rehabilitation applications. In Haptics Symposium (HAPTICS), 2014 IEEE. IEEE, 455–460.Google Scholar
72. Li Zhang, Brian Curless, and Steven M. Seitz. 2003. Spacetime Stereo: Shape Recovery for Dynamic Scenes. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 367–374.Google Scholar
73. Song Zhang, Daniel Van Der Weide, and James Oliver. 2010. Superfast phase-shifting method for 3-D shape measurement. Optics express (2010).Google Scholar
74. Yinda Zhang, Sameh Khamis, Christoph Rhemann, Julien Valentin, Adarsh Kowdle, Vladimir Tankovich, Michael Schoenberg, Shahram Izadi, Thomas Funkhouser, and Sean Fanello. 2018. ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems. ECCV (2018).Google Scholar
75. Michael Zollhöfer, Matthias Nießner, Shahram Izadi, Christoph Rhemann, Christopher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian Theobalt, and Marc Stamminger. 2014. Real-time Non-rigid Reconstruction using an RGB-D Camera. ACM Transactions on Graphics (TOG) 33, 4 (2014). Google ScholarDigital Library
76. Chao Zuo, Qian Chen, Guohua Gu, Shijie Feng, Fangxiaoyu Feng, Rubin Li, and Guochen Shen. 2013. High-speed three-dimensional shape measurement for dynamic scenes using bi-frequency tripolar pulse-width-modulation fringe projection. Optics and Lasers in Engineering (2013).Google Scholar

