“Painting-to-3D Model Alignment via Discriminative Visual Elements” by Aubry, Russell and Sivic

  • ©Mathieu Aubry, Bryan Russell, and Josef Sivic

Conference:


Type:


Title:

    Painting-to-3D Model Alignment via Discriminative Visual Elements

Session/Category Title:   Depth for All Occasions


Presenter(s)/Author(s):


Moderator(s):



Abstract:


    This article describes a technique that can reliably align arbitrary 2D depictions of an architectural site, including drawings, paintings, and historical photographs, with a 3D model of the site. This is a tremendously difficult task, as the appearance and scene structure in the 2D depictions can be very different from the appearance and geometry of the 3D model, for example, due to the specific rendering style, drawing error, age, lighting, or change of seasons. In addition, we face a hard search problem: the number of possible alignments of the painting to a large 3D model, such as a partial reconstruction of a city, is huge. To address these issues, we develop a new compact representation of complex 3D scenes. The 3D model of the scene is represented by a small set of discriminative visual elements that are automatically learned from rendered views. Similar to object detection, the set of visual elements, as well as the weights of individual features for each element, are learned in a discriminative fashion. We show that the learned visual elements are reliably matched in 2D depictions of the scene despite large variations in rendering style (e.g., watercolor, sketch, historical photograph) and structural changes (e.g., missing scene parts, large occluders) of the scene. We demonstrate an application of the proposed approach to automatic rephotography to find an approximate viewpoint of historical paintings and photographs with respect to a 3D model of the site. The proposed alignment procedure is validated via a human user study on a new database of paintings and sketches spanning several sites. The results demonstrate that our algorithm produces significantly better alignments than several baseline methods.

References:


    1. D. Aliaga, P. Rosen, and D. Bekins. 2007. Style grammars for interactive visualization of architecture. IEEE Trans. Vis. Comput. Graph. 13, 4.
    2. G. Baatz, O. Saurer, K. Koser, and M. Pollefeys. 2012. Large scale visual geo-localization of images in mountainous terrain. In Proceedings of the European Conference on Computer Vision.
    3. L. Baboud, M. Cadik, E. Eisemann, and H.-P. Seidel. 2011. Automatic photo-to-terrain alignment for the annotation of mountain pictures. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
    4. F. Bach. and Z. Harchaoui. 2008. Diffrac: A discriminative and flexible framework for clustering. In Advances in Neural Information Processing Systems.
    5. S. Bae, A. Agarwala, and F. Durand. 2010. Computational rephotography. ACM Trans. Graph. 29, 3.
    6. L. Ballan, G. Brostow, J. Puwein, and M. Pollefeys. 2010. Unstructured video-based rendering: Interactive exploration of casually captured videos. ACM Trans. Graph. 29, 4.
    7. C. M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.
    8. F. Bosche. 2010. Automated recognition of 3D CAD model objects in laser scans and calculation of as-built dimensions for dimensional compliance control in construction. Adv. Engin. Inf. 24, 1, 107–118.
    9. O. Chum and J. Matas. 2006. Geometric hashing with local affine frames. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
    10. N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
    11. T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. 2013. Fast, accurate detection of 100,000 object classes on a single machine. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
    12. P. E. Debevec, C. J. Taylor, and J. Malik. 1996. Modeling and rendering architecture from photographs. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’96). 1–20.
    13. C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros. 2012. What makes Paris look like Paris? ACM Trans. Graph. 31, 4.
    14. R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. 2008. Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1, 1871–1874.
    15. P. F. Felzenszwalb, R. B. Girshick, D. Mcallester, and D. Ramanan. 2010. Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 9, 1627–1645.
    16. M. A. Fischler and R. C. Bolles. 1981. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. ACM 24, 6, 381–395.
    17. A. Frome, Y. Singer, F. Sha, and J. Malik. 2007. Learning globally-consistent local distance functions for shape-based image retrieval and classification. In Proceedings of the International Conference on Computer Vision.
    18. Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. 2010. Towards Internet-scale multi-view stereo. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
    19. Y. Furukawa and J. Ponce. 2010. Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 32, 8.
    20. M. Gharbi, T. Malisiewicz, S. Paris, and F. Durand. 2012. A Gaussian approximation of feature space for fast image similarity. Tech. rep. MIT-CSAIL-TR-2012-032. http://people.csail.mit.edu/tomasz/papers/gharbi_techreport_2012.pdf
    21. B. Hariharan, J. Malik, and D. Ramanan. 2012. Discriminative decorrelation for clustering and classification. In Proceedings of the European Conference on Computer Vision.
    22. R. I. Hartley and A. Zisserman. 2004. Multiple View Geometry in Computer Vision 2nd Ed. Cambridge University Press.
    23. D. Hauagge and N. Snavely. 2012. Image matching using local symmetry features. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
    24. D. P. Huttenlocher and S. Ullman. 1987. Object recognition using alignment. In Proceedings of the International Conference on Computer Vision.
    25. A. Irschara, C. Zach, J.-M. Frahm, and H. Bischof. 2009. From structure-from-motion point clouds to fast location recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
    26. A. Jain, A. Gupta, M. Rodriguez, and L. S. Davis. 2013. Representing videos using mid-level discriminative patches. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
    27. M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman. 2013. Blocks that shout: Distinctive parts for scene classification. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
    28. T. Kailath. 1967. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Comm. Technol. 15, 1, 52–60.
    29. J. Kopf, B. Neubert, B. Chen, M. Cohen, D. Cohen-Or, O. Deussen, M. Uyttendaele, and D. Lischinski. 2008. Deep photo: Model-based photograph enhancement and viewing. ACM Trans. Graph. 27, 5.
    30. G. Levin and P. Debevec. 1999. Rouen revisited — Interactive installation. http://acg.media.mit.edu/people/golan/rouen/.
    31. Y. Li, N. Snavely, D. Huttenlocher, and P. Fua. 2012. Worldwide pose estimation using 3D point clouds. In Proceedings of the European Conference on Computer Vision.
    32. D. Lowe. 1987. The viewpoint consistency constraint. Int. J. Comput. Vis. 1, 1, 57–72.
    33. D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 2, 91–110.
    34. T. Malisiewicz, A. Gupta, and A. A. Efros. 2011. Ensemble of exemplar-SVMs for object detection and beyond. In Proceedings of the International Conference on Computer Vision.
    35. P. Musialski, P. Wonka, D. Aliaga, M. Wimmer, L. Van Gool, W. Purgathofer, N. Mitra, M. Pauly, M. Wand, and D. Ceylan, et al. 2012. A survey of urban reconstruction. In Eurographics State of the Art Reports.
    36. A. Oliva and A. Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 42, 3, 145–175.
    37. J. Rapp. 2008. A geometrical analysis of multiple viewpoint perspective in the work of Giovanni Battista Piranesi: An application of geometric restitution of perspective. J. Archit. 13, 6.
    38. B. C. Russell, J. Sivic, J. Ponce, and H. Dessales. 2011. Automatic alignment of paintings and photographs depicting a 3D scene. In Proceedings of the IEEE Workshop on 3D Representation for Recognition (3dRR’11).
    39. T. Sattler, B. Leibe, and L. Kobbelt. 2011. Fast image-based localization using direct 2D-to-3D matching. In Proceedings of the International Conference on Computer Vision.
    40. G. Schindler, M. Brown, and R. Szeliski. 2007. City-scale location recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
    41. S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. 2011. Pegasos: Primal estimated sub-gradient solver for SVM. Math. Program. Series B 127, 1, 3–30.
    42. E. Shechtman and M. Irani. 2007. Matching local self-similarities across images and videos. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
    43. A. Shrivastava, T. Malisiewicz, A. Gupta, and A. A. Efros. 2011. Data-driven visual similarity for cross-domain image matching. ACM Trans. Graph. 30, 6.
    44. S. Singh, A. Gupta, and A. A. Efros. 2012. Unsupervised discovery of mid-level discriminative patches. In Proceedings of the European Conference on Computer Vision.
    45. J. Sivic and A. Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the International Conference on Computer Vision.
    46. N. Snavely, S. M. Seitz, and R. Szeliski. 2006. Photo tourism: Exploring photo collections in 3d. ACM Trans. Graph. 25, 3, 835–846.
    47. R. Szeliski. 2006. Image alignment and stitching: A tutorial. Foundat. Trends Comput. Graph. Vis. 2, 1, 1–104.
    48. R. Szeliski and P. Torr. 1998. Geometrically constrained structure from motion: Points on planes. In European Workshop on 3D Structure from Multiple Images of Large-Scale Environments (SMILE’98).
    49. C. Wu, B. Clipp, X. Li, J.-M. Frahm, and M. Pollefeys. 2008. 3D model matching with viewpoint invariant patches (VIPs). In Proceedings of the Conference on Computer Vision and Pattern Recognition.

ACM Digital Library Publication:



Overview Page: