Learning to be a depth camera for close-range human capture and interaction

We present a machine learning technique for estimating absolute, per-pixel depth using any conventional monocular 2D camera, with minor hardware modifications. Our approach targets close-range human capture and interaction where dense 3D estimation of hands and faces is desired. We use hybrid classification-regression forests to learn how to map from near infrared intensity images to absolute, metric depth in real-time. We demonstrate a variety of human-computer interaction and capture scenarios. Experiments show an accuracy that outperforms a conventional light fall-off baseline, and is comparable to high-quality consumer depth cameras, but with a dramatically reduced cost, power consumption, and form-factor.

References:

1. Ahmed, A. H., and Farag, A. A. 2007. Shape from shading under various imaging conditions. In Proc. CVPR, IEEE, 1–8.Google Scholar
2. Amit, Y., and Geman, D. 1997. Shape quantization and recognition with randomized trees. Neural Computation 9, 7. Google ScholarDigital Library
3. Barron, J. T., and Malik, J. 2013. Shape, illumination, and reflectance from shading. Tech. Rep. UCB/EECS-2013-117, EECS, UC Berkeley, May.Google Scholar
4. Batlle, J., Mouaddib, E., and Salvi, J. 1998. Recent progress in coded structured light as a technique to solve the correspondence problem: a survey. Pattern Recognition 31, 7, 963–982.Google ScholarCross Ref
5. Ben-Arie, J., and Nandy, D. 1998. A neural network approach for reconstructing surface shape from shading. In In Proc. ICIP 98., vol. 2, IEEE, 972–976.Google Scholar
6. Besl, P. J. 1988. Active, optical range imaging sensors. Machine vision and applications 1, 2, 127–152. Google ScholarDigital Library
7. Blais, F. 2004. Review of 20 years of range sensor development. Journal of Electronic Imaging 13, 1.Google ScholarCross Ref
8. Blanz, V., and Vetter, T. 1999. A morphable model for the synthesis of 3D faces. Proc. ACM SIGGRAPH. Google ScholarDigital Library
9. Breiman, L. 2001. Random forests. Machine Learning 45, 1. Google ScholarDigital Library
10. Brown, M. Z., Burschka, D., and Hager, G. D. 2003. Advances in computational stereo. PAMI 25, 8, 993–1008. Google ScholarDigital Library
11. Comaniciu, D., and Meer, P. 2002. Mean shift: A robust approach toward feature space analysis. IEEE Trans. PAMI 24, 5 Google ScholarDigital Library
12. Criminisi, A., and Shotton, J. 2013. Decision Forests for Computer Vision and Medical Image Analysis. Springer. Google ScholarDigital Library
13. Fredembach, C., and Susstrunk, S. 2008. Colouring the near-infrared. In Color and Imaging Conference, vol. 2008, Society for Imaging Science and Technology, 176–182.Google Scholar
14. Ghosh, A., Fyffe, G., Tunwattanapong, B., Busch, J., Yu, X., and Debevec, P. 2011. Multiview face capture using polarized spherical gradient illumination. ACM Transactions on Graphics (TOG) 30, 6, 129. Google ScholarDigital Library
15. Girshick, R., Shotton, J., Kohli, P., Criminisi, A., and Fitzgibbon, A. 2011. Efficient regression of general-activity human poses from depth images. In Proc. ICCV. Google ScholarDigital Library
16. Guan, P., Weiss, A., Balan, A., and Black, M. 2009. Estimating human shape and pose from a single image. In Proc. ICCV.Google Scholar
17. Gurbuz, S. 2009. Application of inverse square law for 3d sensing. In SPIE Optical Engineering+ Applications, International Society for Optics and Photonics, 744706–744706.Google Scholar
18. Hernández, C., Vogiatzis, G., and Cipolla, R. 2008. Multiview photometric stereo. IEEE Trans. PAMI 30, 3, 548–554. Google ScholarDigital Library
19. Hertzmann, A., and Seitz, S. 2005. Example-based photometric stereo: Shape reconstruction with general, varying BRDFs. PAMI 27, 8. Google ScholarDigital Library
20. Hoiem, D., Efros, A., and Hebert, M. 2005. Automatic photo pop-up. In Proc. ACM SIGGRAPH. Google ScholarDigital Library
21. Horn, B. K. 1975. Obtaining shape from shading information. The psychology of computer vision, 115–155.Google Scholar
22. Ideses, I., Yaroslavsky, L., and Fishbain, B. 2007. Real-time 2D to 3D video conversion. J. of Real-Time Image Processing 2, 3–9.Google ScholarCross Ref
23. Jiang, T., Liu, B., Lu, Y., and Evans, D. 2003. A neural network approach to shape from shading. International journal of computer mathematics 80, 4, 433–439.Google Scholar
24. Karsch, K., Liu, C., and Kang, S. 2012. Depth extraction from video using non-parametric sampling. In Proc. ECCV. Google ScholarDigital Library
25. Keskin, C., Kiraç, F., Kara, Y., and Akarun, L. 2012. Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In Proc. ECCV. Google ScholarDigital Library
26. Khan, N., Tran, L., and Tappen, M. 2009. Training many-parameter shape-from-shading models using a surface database. In Proc. ICCV Workshop.Google Scholar
27. Kim, D., Hilliges, O., Izadi, S., Butler, A. D., Chen, J., Oikonomidis, I., and Olivier, P. 2012. Digits: freehand 3d interactions anywhere using a wrist-worn gloveless sensor. In Proceedings of the 25th annual ACM symposium on User interface software and technology, ACM, 167–176. Google ScholarDigital Library
28. Krishnan, D., and Fergus, R. 2009. Dark flash photography. In ACM Transactions on Graphics, SIGGRAPH 2009 Conference Proceedings, vol. 28. Google ScholarDigital Library
29. Lanman, D., and Taubin, G. 2009. Build your own 3D scanner: 3D photography for beginners. In ACM SIGGRAPH 2009 Courses, ACM, 8. Google ScholarDigital Library
30. Liao, M., Wang, L., Yang, R., and Gong, M. 2007. Light fall-off stereo. In Proc. CVPR.Google Scholar
31. Liu, C. P., Cheng, B. H., Chen, P. L., and Jeng, T. R. 2011. Study of three-dimensional sensing by using inverse square law. Magnetics, IEEE Transactions on 47, 3, 687–690.Google ScholarCross Ref
32. Marschner, S. R., Westin, S. H., Lafortune, E. P., Torrance, K. E., and Greenberg, D. P. 1999. Image-based BRDF measurement including human skin. In Rendering Techniques 99. Springer, 131–144. Google ScholarDigital Library
33. Mulligan, J., and Brolly, X. 2004. Surface determination by photometric ranging. In Proc. CVPR Workshop. Google ScholarDigital Library
34. Newcombe, R. A., Izadi, S., et al. 2011. Kinect-fusion: Real-time dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, IEEE, 127–136. Google ScholarDigital Library
35. Prados, E., and Faugeras, O. 2005. Shape from shading: a well-posed problem? In Proc. CVPR, vol. 2. Google ScholarDigital Library
36. Remondino, F., and Stoppa, D. 2013. ToF range-imaging cameras. Springer. Google ScholarDigital Library
37. Rother, C., Kiefel, M., Zhang, L., Schölkopf, B., and Gehler, P. V. 2011. Recovering intrinsic images with a global sparsity prior on reflectance. In Proc. NIPS.Google Scholar
38. Saxena, A., Sun, M., and Ng, A. 2009. Make3D: Learning 3D scene structure from a single still image. PAMI 31, 5, 824–840. Google ScholarDigital Library
39. Scharstein, D., and Szeliski, R. 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In IJCV. Google ScholarDigital Library
40. Shotton, J., Winn, J., Rother, C., and Criminisi, A. 2006. TextonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proc. ECCV. Google ScholarDigital Library
41. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A. 2011. Real-time human pose recognition in parts from single depth images. In Proc. CVPR. Google ScholarDigital Library
42. Simpson, C. R., Kohl, M., Essenpreis, M., and Cope, M. 1998. Near-infrared optical properties of ex vivo human skin and subcutaneous tissues measured using the monte carlo inversion technique. Physics in Medicine and Biology 43, 2465–2478.Google ScholarCross Ref
43. Smith, W. A., and Hancock, E. R. 2008. Facial shape-from-shading and recognition using principal geodesic analysis and robust statistics. International Journal of Computer Vision 76, 1, 71–91. Google ScholarDigital Library
44. Tunwattanapong, B., Fyffe, G., Graham, P., Busch, J., Yu, X., Ghosh, A., and Debevec, P. 2013. Acquiring reflectance and shape from continuous spherical harmonic illumination. ACM Transactions on Graphics (TOG) 32, 4, 109. Google ScholarDigital Library
45. Vineet, V., Rother, C., and Torr, P. 2013. Higher order priors for joint intrinsic image, objects, and attributes estimation. In Proc. NIPS, 557–565.Google Scholar
46. Visentini-Scarzanella, M., Stoyanov, D., and Yang, G.-Z. 2012. Metric depth recovery from monocular images using shape-from-shading and specularities. In Image Processing (ICIP), 2012 19th IEEE International Conference on, IEEE, 25–28.Google Scholar
47. Vogel, O., Breuss, M., Leichtweis, T., and Weickert, J. 2009. Fast shape from shading for Phong-type surfaces. In International Conf. Scale Space and Variational Methods. Google ScholarDigital Library
48. Wang, X., and Yang, R. 2010. Learning 3D shape from a single facial image via non-linear manifold embedding and alignment. In Proc. CVPR.Google Scholar
49. Wei, G.-Q., and Hirzinger, G. 1996. Learning shape from shading by a multilayer network. IEEE Transactions on Neural Networks 7, 4, 985–995. Google ScholarDigital Library
50. Zhang, Z., Tsa, P.-S., Cryer, J. E., and Shah, M. 1999. Shape from shading: A survey. PAMI 21, 8, 690–706. Google ScholarDigital Library
51. Zhang, Z. 2000. A flexible new technique for camera calibration. IEEE Trans. PAMI 22, 11, 1330–1334. Google ScholarDigital Library
52. Zhang, S. 2010. Recent progresses on real-time 3d shape measurement using digital fringe projection techniques. Optics and lasers in engineering 48, 2, 149–158.Google Scholar

ACM Digital Library Publication:

Overview Page:

SIGGRAPH 2014: Technical Papers

“Learning to be a depth camera for close-range human capture and interaction” by Fanello, Keskin, Izadi, Kohli, Shotton, et al. …

Conference:

Type(s):

Title:

Session/Category Title: Computational Sensing & Display

Presenter(s)/Author(s):

Moderator(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Sponsored by: