3D attention-driven depth acquisition for object identification

We address the problem of autonomously exploring unknown objects in a scene by consecutive depth acquisitions. The goal is to reconstruct the scene while online identifying the objects from among a large collection of 3D shapes. Fine-grained shape identification demands a meticulous series of observations attending to varying views and parts of the object of interest. Inspired by the recent success of attention-based models for 2D recognition, we develop a 3D Attention Model that selects the best views to scan from, as well as the most informative regions in each view to focus on, to achieve efficient object recognition. The region-level attention leads to focus-driven features which are quite robust against object occlusion. The attention model, trained with the 3D shape collection, encodes the temporal dependencies among consecutive views with deep recurrent networks. This facilitates order-aware view planning accounting for robot movement cost. In achieving instance identification, the shape collection is organized into a hierarchy, associated with pre-trained hierarchical classifiers. The effectiveness of our method is demonstrated on an autonomous robot (PR) that explores a scene and identifies the objects to construct a 3D scene model.

References:

1. Atanasov, N., Sankaran, B., Ny, J. L., Pappas, G. J., and Daniilidis, K. 2014. Nonmyopic view planning for active object classification and pose estimation. IEEE Trans. on Robotics 30, 5, 1078–1090. Cross Ref
2. Ba, J., Mnih, V., and Kavukcuoglu, K. 2014. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.
3. Bansal, A., Shrivastava, A., Doersch, C., and Gupta, A. 2015. Mid-level elements for object detection. arXiv preprint arXiv:1504.07284.
4. Bart, E., Porteous, I., Perona, P., and Welling, M. 2008. Unsupervised learning of visual taxonomies. In Proc. CVPR, IEEE, 1–8.
5. Chen, K., Lai, Y.-K., Wu, Y.-X., Martin, R., and Hu, S.-M. 2014. Automatic semantic modeling of indoor scenes from low-quality rgb-d data using contextual information. ACM Trans. on Graph. (SIGGRAPH Asia) 33, 6, 208:1–208:15.
6. Choi, S., Zhou, Q.-Y., and Koltun, V. 2015. Robust reconstruction of indoor scenes. In Proc. CVPR, 5556–5565.
7. Choi, S., Zhou, Q.-Y., Miller, S., and Koltun, V. 2016. A large dataset of object scans. arXiv:1602.02481.
8. Corbetta, M., and Shulman, G. L. 2002. Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience 3, 201–215. Cross Ref
9. Doersch, C., Gupta, A., and Efros, A. A. 2015. Unsupervised visual representation learning by context prediction. In Proc. ICCV, 1422–1430.
10. Fisher, M., Ritchie, D., Savva, M., Funkhouser, T., and Hanrahan, P. 2012. Example-based synthesis of 3D object arrangements. ACM Trans. on Graph. (SIGGRAPH Asia) 31, 6, 135:1–135:11.
11. Gao, T., and Koller, D. 2011. Discriminative learning of relaxed hierarchy for large-scale visual recognition. In Proc. ICCV, 2072–2079.
12. Gupta, S., Arbeláez, P., Girshick, R., and Malik, J. 2015. Aligning 3d models to RGB-D images of cluttered scenes. In Proc. CVPR, 4731–4740.
13. Haque, A., Alahi, A., and Fei-Fei, L. 2016. Recurrent attention models for depth-based person identification. In Proc. CVPR.
14. Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9, 8, 1735–1780.
15. Huang, Q.-X., Su, H., and Guibas, L. 2013. Fine-grained semi-supervised labeling of large shape collections. ACM Trans. on Graph. 32, 6, 190:1–190:10.
16. Huang, H., Lischinski, D., Hao, Z., Gong, M., Christie, M., and Cohen-Or, D. 2016. Trip synopsis: 60km in 60sec. Computer Graphics Forum (Pacific Graphics), to appear.
17. Hueting, M., Ovsjanikov, M., and Mitra, N. J. 2015. CrossLink: Joint understanding of image and 3d model collections through shape and camera pose variations. ACM Trans. on Graph. 34, 6, 233.
18. Kleiman, Y., van Kaick, O., Sorkine-Hornung, O., and Cohen-Or, D. 2015. SHED: hape edit distance for fine-grained shape similarity. ACM Trans. on Graph. 34, 6, 235:1–235:14.
19. Krause, J., Jin, H., Yang, J., and Fei-Fei, L. 2015. Fine-grained recognition without part annotations. In Proc. CVPR, 5546–5555.
20. Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. ImageNet classification with deep convolutional neural networks. In Proc. NIPS, 1097–1105.
21. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11, 2278–2324. Cross Ref
22. Li, L.-J., Wang, C., Lim, Y., Blei, D. M., and Fei-Fei, L. 2010. Building and using a semantivisual image hierarchy. In Proc. CVPR, IEEE, 3336–3343.
23. Li, Y., Dai, A., Guibas, L., and Niessner, M. 2015. Database-assisted object retrieval for real-time 3D reconstruction. Computer Graphics Forum (Eurographics) 34, 2.
24. Li, Y., Su, H., Qi, C. R., Fish, N., Cohen-Or, D., and Guibas, L. J. 2015. Joint embeddings of shapes and images via CNN image purification. ACM Trans. on Graph. 34, 6, 234.
25. Mnih, V., Heess, N., Graves, A., et al. 2014. Recurrent models of visual attention. In Proc. NIPS, 2204–2212.
26. Newcombe, R. A., Davison, A. J., Izadi, S., Kohli, P., Hilliges, O., Shotton, J., Molyneaux, D., Hodges, S., Kim, D., and Fitzgibbon, A. 2011. KinectFusion: Real-time dense surface mapping and tracking. In Proc. IEEE Int. Symp. on Mixed and Augmented Reality, 127–136.
27. Niessner, M., Zollhöfer, M., Izadi, S., and Stamminger, M. 2013. Real-time 3D reconstruction at scale using voxel hashing. ACM Trans. on Graph. (SIGGRAPH Asia) 32, 6, 169:1–169:11.
28. Nister, D., and Stewenius, H. 2006. Scalable recognition with a vocabulary tree. In Proc. CVPR, 2161–2168.
29. ROS, 2014. ROS Wiki. http://wiki.ros.org/.
30. Salas-Moreno, R. F., Newcombe, R. A., Strasdat, H., Kelly, P. H. J., and Davison, A. J. 2012. SLAM++: Simultaneous localisation and mapping at the level of objects. In CVPR, 1352–1359.
31. Shi, Y., Long, P., Xu, K., Huang, H., and Xiong, Y. 2016. Data-driven contextual modeling for 3d scene understanding. Computers and Graphics 55, 55–67.
32. Song, S., and Xiao, J. 2016. Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proc. CVPR.
33. Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E. 2015. Multi-view convolutional neural networks for 3D shape recognition. In Proc. ICCV.
34. Su, H., Qi, C. R., Li, Y., and Guibas, L. 2015. Render for CNN: Viewpoint estimation in images using cnns trained with rendered 3d model views. In Proc. ICCV.
35. Su, H., Savva, M., Yi, L., Chang, A. X., Song, S., Yu, F., Li, Z., Xiao, J., Huang, Q., Savarese, S., Funkhouser, T., Hanrahan, P., and Guibas, L. J. 2015. ShapeNet: An information-rich 3d model repository. http://www.shapenet.org/.
36. Uijlings, J. R., van de Sande, K. E., Gevers, T., and Smeulders, A. W. 2013. Selective search for object recognition. Int. J. Computer Vision. 104, 2, 154–171.
37. Valentin, J., Vineet, V., Cheng, M.-M., Kim, D., Shotton, J., Kohli, P., Niessner, M., Criminisi, A., Izadi, S., and Torr, P. 2015. SemanticPaint: Interactive 3D labeling and learning at your finger tips. ACM Trans. on Graph. 34, 5.
38. Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3–4, 229–256.
39. Wu, S., Sun, W., Long, P., Huang, H., Cohen-Or, D., Gong, M., Deussen, O., and Chen, B. 2014. Quality-driven poisson-guided autoscanning. ACM Trans. on Graph. (SIGGRAPH Asia) 33, 6, 203:1–203:12.
40. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., and Xiao, J. 2015. 3D ShapeNets: A deep representation for volumetric shapes. In Proc. CVPR, 1912–1920.
41. Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., and Zhang, Z. 2015. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proc. CVPR, 842–850.
42. Xu, K., Chen, K., Fu, H., Sun, W.-L., and Hu, S.-M. 2013. Sketch2Scene: Sketch-based co-retrieval and co-placement of 3D models. ACM Trans. on Graph. (SIGGRAPH) 32, 4, 123:1–123:10.
43. Xu, K., Huang, H., Shi, Y., Li, H., Long, P., Caichen, J., Sun, W., and Chen, B. 2015. Autoscanning for coupled scene reconstruction and proactive object analysis. ACM Trans. on Graph. 34, 6, 177:1–177:14.
44. Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044.
45. Zelnik-Manor, L., and Perona, P. 2004. Self-tuning spectral clustering. In Proc. NIPS, 1601–1608.
46. Zhang, Y., Xu, W., Tong, Y., and Zhou, K. 2014. Online structure analysis for real-time indoor scene reconstruction. ACM Trans. on Graph. 34, 5, 159:1–159:12.

ACM Digital Library Publication:

Overview Page:

SIGGRAPH Asia 2016: Technical Papers

Submit a story:

If you would like to submit a story about this presentation, please contact us: historyarchives@siggraph.org

ACM SIGGRAPH HISTORY ARCHIVES

“3D attention-driven depth acquisition for object identification” by Xu, Shi, Zheng, Zhang, Liu, et al. …

Conference:

Type(s):

Title:

Session/Category Title:

Presenter(s)/Author(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Submit a story:

Sponsored by: