Joint embeddings of shapes and images via CNN image purification

Both 3D models and 2D images contain a wealth of information about everyday objects in our environment. However, it is difficult to semantically link together these two media forms, even when they feature identical or very similar objects. We propose a joint embedding space populated by both 3D shapes and 2D images of objects, where the distances between embedded entities reflect similarity between the underlying objects. This joint embedding space facilitates comparison between entities of either form, and allows for cross-modality retrieval. We construct the embedding space using 3D shape similarity measure, as 3D shapes are more pure and complete than their appearance in images, leading to more robust distance metrics. We then employ a Convolutional Neural Network (CNN) to “purify” images by muting distracting factors. The CNN is trained to map an image to a point in the embedding space, so that it is close to a point attributed to a 3D model of a similar object to the one depicted in the image. This purifying capability of the CNN is accomplished with the help of a large amount of training data consisting of images synthesized from 3D shapes. Our joint embedding allows cross-view image retrieval, image-based shape retrieval, as well as shape-based image retrieval. We evaluate our method on these retrieval tasks and show that it consistently out-performs state-of-the-art methods, and demonstrate the usability of a joint embedding in a number of additional applications.

References:

1. Aubry, M., Maturana, D., Efros, A. A., Russell, B. C., and Sivic, J. 2014. Seeing 3d chairs: Exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR, IEEE, 3762–3769.
2. Averbuch-Elor, H., Wang, Y., Qian, Y., Gong, M., Kopf, J., Zhang, H., and Cohen-Or, D. 2015. Distilled collections from textual image queries. Computer Graphics Forum 34, 2, 131–142.
3. Bell, S., and Bala, K. 2015. Learning visual similarity for product design with convolutional neural networks. ACM Trans. Graph. 34, 4 (July), 98:1–98:10.
4. Carlsson, G., Zomorodian, A., Collins, A., and Guibas, L. 2004. Persistence barcodes for shapes. In SGP, ACM, 124–135.
5. Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. 2014. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference, British Machine Vision Association.
6. Chen, D.-Y., and Ouhyoung, M. 2002. A 3d object retrieval system based on multi-resolution reeb graph. In Proc. of Computer Graphics Workshop, vol. 16.
7. Chen, D.-Y., Tian, X.-P., Shen, Y.-T., and Ouhyoung, M. 2003. On visual similarity based 3d model retrieval. Computer Graphics Forum 22, 3, 223–232.
8. Choi, W., Chao, Y.-W., Pantofaru, C., and Savarese, S. 2013. Understanding indoor scenes using 3d geometric phrases. In CVPR, IEEE, 33–40.
9. Csurka, G., Dance, C., Fan, L., Willamowski, J., and Bray, C. 2004. Visual categorization with bags of keypoints. In ECCV, vol. 1, IEEE, 1–2.
10. Cyr, C. M., and Kimia, B. B. 2001. 3d object recognition using shape similiarity-based aspect graph. In ICCV, vol. 1, IEEE, 254–261.
11. Dalal, N., and Triggs, B. 2005. Histograms of oriented gradients for human detection. In CVPR, vol. 1, IEEE, 886–893.
12. Funkhouser, T., Min, P., Kazhdan, M., Chen, J., Halderman, A., Dobkin, D., and Jacobs, D. 2003. A search engine for 3d models. ACM Trans. Graph. 22, 1, 83–105.
13. Gal, R., Shamir, A., and Cohen-Or, D. 2007. Pose-oblivious shape signature. TVCG 13, 2, 261–271.
14. Hadsell, R., Chopra, S., and LeCun, Y. 2006. Dimensionality reduction by learning an invariant mapping. In CVPR, vol. 2, IEEE, 1735–1742.
15. He, X., Cai, D., Yan, S., and Zhang, H.-J. 2005. Neighborhood preserving embedding. In ICCV, vol. 2, IEEE, 1208–1213.
16. Herzog, R., Mewes, D., Wand, M., Guibas, L., and Seidel, H.-P. 2015. Lesss: Learned shared semantic spaces for relating multi-modal representations of 3d shapes. Computer Graphics Forum 34, 5, 141–151.
17. Hilaga, M., Shinagawa, Y., Kohmura, T., and Kunii, T. L. 2001. Topology matching for fully automatic similarity estimation of 3d shapes. In Proc. of SIGGRAPH ’01, ACM, 203–212.
18. Huang, Q.-X., Su, H., and Guibas, L. 2013. Fine-grained semi-supervised labeling of large shape collections. ACM Trans. Graph. 32, 6 (Nov.), 190:1–190:10.
19. Huang, Q., Wang, H., and Koltun, V. 2015. Single-view reconstruction via joint analysis of image and shape collections. ACM Trans. Graph. 34, 4.
20. Kazhdan, M., Funkhouser, T., and Rusinkiewicz, S. 2003. Rotation invariant spherical harmonic representation of 3 d shape descriptors. In SGP, vol. 6, ACM.
21. Kholgade, N., Simon, T., Efros, A., and Sheikh, Y. 2014. 3d object manipulation in a single photograph using stock 3d models. ACM Trans. Graph. 33, 4 (July), 127:1–127:12.
22. Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. Curran Associates, Inc., 1097–1105.
23. Kruskal, J. B. 1964. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1, 1–27.
24. LeCun, Y., Bengio, Y., and Hinton, G. 2015. Deep learning. Nature 521, 436–444.
25. Lee, J., Kim, Y., Lee, S., Kim, B., and Noh, J. 2015. High-quality depth estimation using an exemplar 3d model for stereo conversion. IEEE TVCG PP, 99, 1–1.
26. Liu, D. C., and Nocedal, J. 1989. On the limited memory bfgs method for large scale optimization. Math. Program. 45, 3 (Dec.), 503–528.
27. Liu, Y., Zhang, D., Lu, G., and Ma, W.-Y. 2007. A survey of content-based image retrieval with high-level semantics. Pattern Recognition 40, 1, 262–282.
28. Loffler, J. 2000. Content-based retrieval of 3d models in distributed web databases by visual shape information. In InfoVis, IEEE, 82–87.
29. Ohbuchi, R., Minamitani, T., and Takei, T. 2005. Shape-similarity search of 3d models by using enhanced shape functions. International Journal of Computer Applications in Technology 23, 2, 70–85.
30. Osada, R., Funkhouser, T., Chazelle, B., and Dobkin, D. 2002. Shape distributions. ACM Trans. Graph. 21, 4, 807–832.
31. Roweis, S. T., and Saul, L. K. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 5500, 2323–2326.
32. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. 2014. Imagenet large scale visual recognition challenge. IJCV, 1–42.
33. Sammon, J. W. 1969. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers 18, 5, 401–409.
34. Sánchez, J., Perronnin, F., Mensink, T., and Verbeek, J. 2013. Image classification with the fisher vector: Theory and practice. IJCV 105, 3, 222–245.
35. Smeulders, A. W., Worring, M., Santini, S., Gupta, A., and Jain, R. 2000. Content-based image retrieval at the end of the early years. TPAMI 22, 12, 1349–1380.
36. Su, H., Huang, Q., Mitra, N. J., Li, Y., and Guibas, L. 2014. Estimating image depth using shape collections. ACM Trans. Graph. 33, 4 (July), 37:1–37:11.
37. Su, H., Qi, C. R., Li, Y., and Guibas, L. 2015. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In ICCV, IEEE.
38. Su, H., Yi, E., Savva, M., Chang, A., Song, S., Yu, F., Li, Z., Xiao, J., Huang, Q., Savarese, S., Funkhouser, T., Hanrahan, P., and Guibas, L., 2015. Shapenet: An ongoing effort to establish a richly-annotated, large-scale dataset of 3d shapes. http://shapenet.org.
39. Sundar, H., Silver, D., Gagvani, N., and Dickinson, S. 2003. Skeleton based shape matching and retrieval. In Shape Modeling International, 2003, IEEE, 130–139.
40. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. 2014. Going deeper with convolutions. arXiv preprint arXiv:1409.4842.
41. Tangelder, J. W., and Veltkamp, R. C. 2008. A survey of content based 3d shape retrieval methods. Multimedia tools and applications 39, 3, 441–471.
42. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong, Y. 2010. Locality-constrained linear coding for image classification. In CVPR, IEEE, 3360–3367.
43. Weston, J., Bengio, S., and Usunier, N. 2010. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning 81, 1, 21–35.
44. Weston, J., Bengio, S., and Usunier, N. 2011. Wsabie: Scaling up to large vocabulary image annotation. In International Joint Conference on Artificial Intelligence, AAAI Press, 2764–2770.
45. Xu, K., Zheng, H., Zhang, H., Cohen-Or, D., Liu, L., and Xiong, Y. 2011. Photo-inspired model-driven 3d object modeling. ACM Trans. Graph. 30, 4 (July), 80:1–80:10.
46. Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. 2014. How transferable are features in deep neural networks? In NIPS. Curran Associates, Inc., 3320–3328.
47. Zheng, Y., Chen, X., Cheng, M.-M., Zhou, K., Hu, S.-M., and Mitra, N. J. 2012. Interactive images: Cuboid proxies for smart image manipulation. ACM Trans. Graph. 31, 4 (July), 99:1–99:11.

ACM Digital Library Publication:

Overview Page:

SIGGRAPH Asia 2015: Technical Papers

Submit a story:

If you would like to submit a story about this presentation, please contact us: historyarchives@siggraph.org

ACM SIGGRAPH HISTORY ARCHIVES

“Joint embeddings of shapes and images via CNN image purification” by Li, Su, Qi, Fish, Cohen-Or, et al. …

Conference:

Type(s):

Title:

Session/Category Title:

Presenter(s)/Author(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Submit a story:

Sponsored by: