Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences

Fully articulated hand tracking promises to enable fundamentally new interactions with virtual and augmented worlds, but the limited accuracy and efficiency of current systems has prevented widespread adoption. Today’s dominant paradigm uses machine learning for initialization and recovery followed by iterative model-fitting optimization to achieve a detailed pose fit. We follow this paradigm, but make several changes to the model-fitting, namely using: (1) a more discriminative objective function; (2) a smooth-surface model that provides gradients for non-linear optimization; and (3) joint optimization over both the model pose and the correspondences between observed data points and the model surface. While each of these changes may actually increase the cost per fitting iteration, we find a compensating decrease in the number of iterations. Further, the wide basin of convergence means that fewer starting points are needed for successful model fitting. Our system runs in real-time on CPU only, which frees up the commonly over-burdened GPU for experience designers. The hand tracker is efficient enough to run on low-power devices such as tablets. We can track up to several meters from the camera to provide a large working volume for interaction, even using the noisy data from current-generation depth cameras. Quantitative assessments on standard datasets show that the new approach exceeds the state of the art in accuracy. Qualitative results take the form of live recordings of a range of interactive experiences enabled by this new approach.

References:

1. 3Gear Systems Inc, 2013. Gesture recognizer. http://threegear.com, Jan.Google Scholar
2. Athitsos, V., and Sclaroff, S. 2003. Estimating 3D hand pose from a cluttered image. In Proc. CVPR, vol. 2, II–432.Google Scholar
3. Ballan, L., Taneja, A., Gall, J., Gool, L. V., and Pollefeys, M. 2012. Motion capture of hands in action using discriminative salient points. In Proc. ECCV, 640–653. Google ScholarDigital Library
4. Bray, M., Koller-Meier, E., and Van Gool, L. 2004. Smart particle filtering for 3D hand tracking. In Proc. Automatic Face and Gesture Recognition, 675–680. Google ScholarDigital Library
5. de La Gorce, M., Fleet, D. J., and Paragios, N. 2011. Model-Based 3D Hand Pose Estimation from Monocular Video. IEEE Trans. PAMI 33, 9, 1793–1805. Google ScholarDigital Library
6. Dipietro, L., Sabatini, A. M., and Dario, P. 2008. A survey of glove-based systems and their applications. IEEE Trans. Systems, Man, and Cybernetics, Part C: Applications and Reviews 38, 4, 461–482. Google ScholarDigital Library
7. Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., and Twombly, X. 2007. Vision-based hand pose estimation: A review. CVIU 108, 1-2, 52–73. Google ScholarDigital Library
8. Fleishman, S., Kliger, M., Lerner, A., and Kutliroff, G. 2015. ICPIK: Inverse kinematics based articulated-ICP. In Proc. CVPR Workshops, 28–35.Google Scholar
9. Geman, S., and McClure, D. E. 1987. Statistical methods for tomographic image reconstruction. Bulletin of the International Statistical Institute 52, 4, 5–21.Google Scholar
10. Guzmán-Rivera, A., Kohli, P., Glocker, B., Shotton, J., Sharp, T., Fitzgibbon, A. W., and Izadi, S. 2014. Multi-output learning for camera relocalization. In Proc. CVPR, 1114–1121. Google ScholarDigital Library
11. Heap, T., and Hogg, D. 1996. Towards 3D hand tracking using a deformable model. In Proc. Automatic Face and Gesture Recognition, 140–145. Google ScholarDigital Library
12. Intel Corporation, 2016. RealSense SDK. http://software.intel.com/realsense, Jan.Google Scholar
13. Jacobson, A., Deng, Z., Kavan, L., and Lewis, J. 2014. Skinning: Real-time shape deformation. In ACM SIGGRAPH 2014 Courses, #24. Google ScholarDigital Library
14. Keskin, C., Kiraç, F., Kara, Y. E., and Akarun, L. 2012. Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In Proc. ECCV, 852–863. Google ScholarDigital Library
15. Khamis, S., Taylor, J., Shotton, J., Keskin, C., Izadi, S., and Fitzgibbon, A. 2015. Learning an efficient model of hand shape variation from depth images. In Proc. CVPR, 2540–2548.Google Scholar
16. Kim, D., Hilliges, O., Izadi, S., Butler, A. D., Chen, J., Oikonomidis, I., and Olivier, P. 2012. Digits: freehand 3D interactions anywhere using a wrist-worn gloveless sensor. In Proc. UIST, 167–176. Google ScholarDigital Library
17. Krupka, E., Bar Hillel, A., Klein, B., Vinnikov, A., Freedman, D., and Stachniak, S. 2014. Discriminative ferns ensemble for hand pose recognition. In Proc. CVPR, 3670–3677. Google ScholarDigital Library
18. Leap Motion Inc, 2013. Motion Controller. http://leapmotion.com/product, Jan.Google Scholar
19. Leap Motion Inc, 2015. Orion. http://developer.leapmotion.com/orion, Feb.Google Scholar
20. Li, P., Ling, H., Li, X., and Liao, C. 2015. 3D hand pose estimation using randomized decision forest with segmentation index points. In Proc. ICCV, 819–827. Google ScholarDigital Library
21. Loop, C. T. 1987. Smooth Subdivision Surfaces Based on Triangles. Master’s thesis, University of Utah.Google Scholar
22. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., and Black, M. J. 2015. SMPL: a skinned multi-person linear model. ACM Trans. Graphics 34, 6, #248. Google ScholarDigital Library
23. Makris, A., Kyriazis, N., and Argyros, A. 2015. Hierarchical particle filtering for 3D hand tracking. In Proc. CVPR Workshops, 8–17.Google Scholar
24. Melax, S., Keselman, L., and Orsten, S. 2013. Dynamics based 3D skeletal hand tracking. In Proceedings of the 2013 Graphics Interface Conference, 63–70. Google ScholarDigital Library
25. Mitchell, D. P. 1991. Spectrally optimal sampling for distribution ray tracing. In Proc. SIGGRAPH, 157–164. Google ScholarDigital Library
26. Monnai, Y., Hasegawa, K., Fujiwara, M., Yoshino, K., Inoue, S., and Shinoda, H. 2014. HaptoMime: Mid-air haptic interaction with a floating virtual screen. In Proc. UIST, 663–667. Google ScholarDigital Library
27. Neverova, N., Wolf, C., Nebout, F., and Taylor, G. 2015. Hand pose estimation through weakly-supervised learning of a rich intermediate representation. arXiv preprint 1511.06728.Google Scholar
28. Oberweger, M., Wohlhart, P., and Lepetit, V. 2015. Training a feedback loop for hand pose estimation. In Proc. ICCV, 3316–3324. Google ScholarDigital Library
29. Oikonomidis, I., Kyriazis, N., and Argyros, A. 2011. Efficient model-based 3D tracking of hand articulations using Kinect. In Proc. BMVC, 101.1–101.11.Google Scholar
30. Poier, G., Roditakis, K., Schulter, S., Michel, D., Bischof, H., and Argyros, A. A. 2015. Hybrid one-shot 3D hand pose estimation by exploiting uncertainties. In Proc. BMVC, 182.1–182.14.Google Scholar
31. Qian, C., Sun, X., Wei, Y., Tang, X., and Sun, J. 2014. Realtime and robust hand tracking from depth. In Proc. CVPR, 1106–1113. Google ScholarDigital Library
32. Rehg, J. M., and Kanade, T. 1994. Visual tracking of high DOF articulated structures: an application to human hand tracking. In Proc. ECCV, 35–46. Google ScholarDigital Library
33. Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., and Izadi, S. 2015. Accurate, robust, and flexible realtime hand tracking. In Proc. CHI, 3633–3642. Google ScholarDigital Library
34. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A. 2011. Real-time human pose recognition in parts from a single depth image. In Proc. CVPR, 1297–1304. Google ScholarDigital Library
35. Shotton, J., Sharp, T., Kohli, P., Nowozin, S., Winn, J., and Criminisi, A. 2013. Decision jungles: Compact and rich models for classification. In NIPS, 234–242.Google Scholar
36. Sridhar, S., Oulasvirta, A., and Theobalt, C. 2013. Interactive markerless articulated hand motion tracking using RGB and depth data. In Proc. ICCV, 2456–2463. Google ScholarDigital Library
37. Sridhar, S., Rhodin, H., Seidel, H.-P., Oulasvirta, A., and Theobalt, C. 2014. Real-time hand tracking using a sum of anisotropic Gaussians model. In Proc. 3DV, 319–326. Google ScholarDigital Library
38. Sridhar, S., Mueller, F., Oulasvirta, A., and Theobalt, C. 2015. Fast and robust hand tracking using detection-guided optimization. In Proc. CVPR, 3213–3221.Google Scholar
39. Stenger, B., Mendonça, P. R., and Cipolla, R. 2001. Model-based 3D tracking of an articulated hand. In Proc. CVPR, vol. 2, II–310.Google Scholar
40. Sun, X., Wei, Y., Liang, S., Tang, X., and Sun, J. 2015. Cascaded hand pose regression. In Proc. CVPR, 824–832.Google Scholar
41. Tagliasacchi, A., Schröder, M., Tkach, A., Bouaziz, S., Botsch, M., and Pauly, M. 2015. Robust articulated-ICP for real-time hand tracking. Computer Graphics Forum 34, 5, 101–114.Google ScholarCross Ref
42. Tan, D. J., Cashman, T., Taylor, J., Fitzgibbon, A., Tarlow, D., Khamis, S., Izadi, S., and Shotton, J. 2016. Fits like a glove: Rapid and reliable hand shape personalization. In Proc. CVPR.Google Scholar
43. Tang, D., Yu, T.-H., and Kim, T.-K. 2013. Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In Proc. ICCV, 3224–3231. Google ScholarDigital Library
44. Tang, D., Taylor, J., Kohli, P., Keskin, C., Kim, T.-K., and Shotton, J. 2015. Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In Proc. ICCV, 3325–3333. Google ScholarDigital Library
45. Taylor, J., Shotton, J., Sharp, T., and Fitzgibbon, A. 2012. The Vitruvian Manifold: Inferring dense correspondences for one-shot human pose estimation. In Proc. CVPR, 103–110. Google ScholarDigital Library
46. Taylor, J., Stebbing, R., Ramakrishna, V., Keskin, C., Shotton, J., Izadi, S., Hertzmann, A., and Fitzgibbon, A. 2014. User-specific hand modeling from monocular depth sequences. In Proc. CVPR, 644–651. Google ScholarDigital Library
47. Tejani, A., Tang, D., Kouskouridas, R., and Kim, T.-K. 2014. Latent-class Hough forests for 3D object detection and pose estimation. In Proc. ECCV, 462–477.Google Scholar
48. Tompson, J., Stein, M., Lecun, Y., and Perlin, K. 2014. Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graphics 33, 5, #169. Google ScholarDigital Library
49. Triggs, W., McLauchlan, P., Hartley, R., and Fitzgibbon, A. 2000. Bundle adjustment — A modern synthesis. In Vision Algorithms: Theory and Practice, LNCS. 298–372. Google ScholarDigital Library
50. Tzionas, D., Ballan, L., Srikantha, A., Aponte, P., Pollefeys, M., and Gall, J. 2015. Capturing hands in action using discriminative salient points and physics simulation. arXiv preprint 1506.02178. Google ScholarDigital Library
51. Ultrahaptics Ltd, 2013. Haptics System. http://ultrahaptics.com, Jan. Valentin, J., Dai, A., Niessner, M., Kohli, P., Torr, P., Izadi, S., and Keskin, C. 2016. Learning to navigate the energy landscape. arXiv preprint 1603.05772.Google Scholar
52. Vicente, S., and Agapito, L. 2013. Balloon shapes: reconstructing and deforming objects with volume from images. In Proc. 3DV, 223–230. Google ScholarDigital Library
53. Wang, R. Y., and Popović, J. 2009. Real-time hand-tracking with a color glove. ACM Trans. Graphics 28, 3, #63. Google ScholarDigital Library
54. Wang, R., Paris, S., and Popović, J. 2011. 6D hands. In Proc. UIST, 549–558. Google ScholarDigital Library
55. Wang, Y., Min, J., Zhang, J., Liu, Y., Xu, F., Dai, Q., and Chai, J. 2013. Video-based hand manipulation capture through composite motion control. ACM Trans. Graphics 32, 4 (July), 43:1–43:14. Google ScholarDigital Library
56. Wu, Y., and Huang, T. S. 2000. View-independent recognition of hand postures. In Proc. CVPR, vol. 2, 88–94.Google Scholar
57. Wu, Y., Lin, J. Y., and Huang, T. S. 2001. Capturing natural hand articulation. In Proc. ICCV, vol. 2, 426–432.Google Scholar
58. Xu, C., and Cheng, L. 2013. Efficient hand pose estimation from a single depth image. In Proc. ICCV, 3456–3462. Google ScholarDigital Library
59. Zach, C. 2014. Robust bundle adjustment revisited. In Proc. ECCV, 772–787.Google ScholarCross Ref
60. Zhao, W., Chai, J., and Xu, Y.-Q. 2012. Combining marker-based mocap and RGB-D camera for acquiring high-fidelity hand motion data. In Proc. Symposium on Computer Animation, 33–42. Google ScholarDigital Library

ACM Digital Library Publication:

Overview Page:

SIGGRAPH 2016: Technical Papers

“Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences”

Conference:

Type(s):

Title:

Session/Category Title: USER INTERFACES

Presenter(s)/Author(s):

Moderator(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Sponsored by: