3D Wikipedia: using online text to automatically label and navigate reconstructed geometry

We introduce an approach for analyzing Wikipedia and other text, together with online photos, to produce annotated 3D models of famous tourist sites. The approach is completely automated, and leverages online text and photo co-occurrences via Google Image Search. It enables a number of new interactions, which we demonstrate in a new 3D visualization tool. Text can be selected to move the camera to the corresponding objects, 3D bounding boxes provide anchors back to the text describing them, and the overall narrative of the text provides a temporal guide for automatically flying through the scene to visualize the world as you read about it. We show compelling results on several major tourist sites.

References:

1. Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Curless, B., Seitz, S. M., and Szeliski, R. 2011. Building rome in a day. Communications of the ACM 54, 10 (Oct.), 105–112.
2. Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., and Jordan, M. I. 2003. Matching words and pictures. Journal of Machine Learning Research 3, 1107–1135.
3. Berg, A. C., Berg, T. L., III, H. D., Dodge, J., Goyal, A., Han, X., Mensch, A., Mitchell, M., Sood, A., Stratos, K., and Yamaguchi, K. 2012. Understanding and predicting importance in images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3562–3569.
4. Berlitz International, I. 2003. Berlitz Rome Pocket Guide. Berlitz Pocket Guides Series. Berlitz International, Incorporated.
5. Buckley, C. 1995. Automatic query expansion using SMART: TREC 3. In Proceedings of the third Text REtrieval Conference (TREC-3), 69–80.
6. Chum, O., Philbin, J., Sivic, J., Isard, M., and Zisserman, A. 2007. Total recall: Automatic query expansion with a generative feature model for object retrieval. In IEEE 11th International Conference on Computer Vision (ICCV), 1–8.
7. Cour, T., Sapp, B., and Taskar, B. 2011. Learning from partial labels. Journal of Machine Learning Research 12 (May), 1501–1536.
8. Crandall, D., Backstrom, L., Huttenlocher, D., and Kleinberg, J. 2009. Mapping the world’s photos. In Proceedings of the 18th International Conference on World Wide Web (WWW), 761–770.
9. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. 2010. The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2, 303–338.
10. Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision (ECCV), 15–29.
11. Furukawa, Y., and Ponce, J. 2010. Accurate, dense, and robust multi-view stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 8, 1362–1376.
12. Furukawa, Y., Curless, B., Seitz, S. M., and Szeliski, R. 2010. Towards internet-scale multi-view stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1434–1441.
13. Garwood, D., and Hole, A. 2012. Lonely Planet Rome. Travel Guide. Lonely Planet Publications.
14. Goesele, M., Snavely, N., Curless, B., Hoppe, H., and Seitz, S. M. 2007. Multi-view stereo for community photo collections. In IEEE 11th International Conference on Computer Vision (ICCV), 1–8.
15. Hartley, R. I., and Zisserman, A. 2004. Multiple View Geometry in Computer Vision, second ed. Cambridge University Press, ISBN: 0521540518.
16. Hays, J., and Efros, A. A. 2008. IM2GPS: estimating geographic information from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1–8.
17. Kazhdan, M., Bolitho, M., and Hoppe, H. 2006. Poisson surface reconstruction. In Proceedings of the 4th Eurographics Symposium on Geometry Processing (SGP), 61–70.
18. Klein, D., and Manning, C. D. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Meeting of the Association for Computational Linguistics, 423–430.
19. Ladický, L., Sturgess, P., Russell, C., Sengupta, S., Bastanlar, Y., Clocksin, W., and Torr, P. H. S. 2012. Joint optimization for object class segmentation and dense stereo reconstruction. International Journal of Computer Vision 100, 2, 122–133.
20. Laptev, I., Marszalek, M., Schmid, C., and Rozenfeld, B. 2008. Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1–8.
21. Lowe, D. G. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2, 91–110.
22. Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Sratos, K., Han, X., Mensch, A., Berg, A. C., Berg, T. L., and Daumé III, H. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 747–756.
23. Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman, A. 2008. Lost in quantization: Improving particular object retrieval in large scale image databases. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1–8.
24. Raguram, R., Wu, C., Frahm, J.-M., and Lazebnik, S. 2011. Modeling and recognition of landmark image collections using iconic scene graphs. International Journal of Computer Vision 95, 3, 213–239.
25. Ren, X., Bo, L., and Fox, D. 2012. RGB-(D) Scene labeling: Features and algorithms. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2759–2766.
26. Russell, B. C., Torralba, A., Murphy, K. P., and Freeman, W. T. 2008. LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision 77, 1–3, 157–173.
27. Salton, G., and Buckley, C. 1999. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science 41, 4, 288–297.
28. Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. 2012. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision (ECCV), 746–760.
29. Simon, I., and Seitz, S. M. 2008. Scene segmentation using the wisdom of crowds. In European Conference on Computer Vision (ECCV), 541–553.
30. Sivic, J., and Zisserman, A. 2003. Video Google: A text retrieval approach to object matching in videos. In IEEE 9th International Conference on Computer Vision (ICCV), 1470–1477.
31. Snavely, N., Seitz, S. M., and Szeliski, R. 2006. Photo tourism: Exploring photo collections in 3D. ACM Transactions on Graphics (SIGGRAPH) 25, 3, 835–846.
32. Snavely, N., Seitz, S. M., and Szeliski, R. 2008. Modeling the world from Internet photo collections. International Journal of Computer Vision 80, 2, 189–210.
33. Stop words list. http://norm.al/2009/04/14/list-of-english-stop-words/.
34. Wikipedia. http://www.wikipedia.org.
35. Wu, C. SiftGPU: A GPU implementation of scale invaraint feature transform (SIFT). http://cs.unc.edu/~ccwu/siftgpu.
36. Wu, C. VisualSFM: A visual structure from motion system. http://homes.cs.washington.edu/~ccwu/vsfm/.
37. Wu, C., Agarwal, S., Curless, B., and Seitz, S. M. 2011. Multicore bundle adjustment. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3057–3064.

ACM Digital Library Publication:

Overview Page:

SIGGRAPH Asia 2013: Technical Papers

Submit a story:

If you would like to submit a story about this presentation, please contact us: historyarchives@siggraph.org

ACM SIGGRAPH HISTORY ARCHIVES

“3D Wikipedia: using online text to automatically label and navigate reconstructed geometry” by Russell, Martin-Brualla, Butler, Seitz and Zettlemoyer

Conference:

Type(s):

Title:

Session/Category Title:

Presenter(s)/Author(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Submit a story:

Sponsored by: