“VideoSnapping: interactive synchronization of multiple videos” by Wang, Schroers, Zimmer, Gross and Sorkine-Hornung

  • ©Oliver Wang, Christopher Schroers, Henning Zimmer, Markus Gross, and Alexander Sorkine-Hornung



Session Title:

    Video Applications


    VideoSnapping: interactive synchronization of multiple videos




    Aligning video is a fundamental task in computer graphics and vision, required for a wide range of applications. We present an interactive method for computing optimal nonlinear temporal video alignments of an arbitrary number of videos. We first derive a robust approximation of alignment quality between pairs of clips, computed as a weighted histogram of feature matches. We then find optimal temporal mappings (constituting frame correspondences) using a graph-based approach that allows for very efficient evaluation with artist constraints. This enables an enhancement to the “snapping” interface in video editing tools, where videos in a time-line are now able snap to one another when dragged by an artist based on their content, rather than simply start-and-end times. The pairwise snapping is then generalized to multiple clips, achieving a globally optimal temporal synchronization that automatically arranges a series of clips filmed at different times into a single consistent time frame. When followed by a simple spatial registration, we achieve high quality spatiotemporal video alignments at a fraction of the computational complexity compared to previous methods. Assisted temporal alignment is a degree of freedom that has been largely unexplored, but is an important task in video editing. Our approach is simple to implement, highly efficient, and very robust to differences in video content, allowing for interactive exploration of the temporal alignment space for multiple real world HD videos.


    1. Agarwala, A., Zheng, K. C., Pal, C., Agrawala, M., Cohen, M. F., Curless, B., Salesin, D., and Szeliski, R. 2005. Panoramic video textures. ACM Trans. Graph. 24, 3, 821–827. Google ScholarDigital Library
    2. Baker, S., and Matthews, I. 2004. Lucas-kanade 20 years on: A unifying framework. IJCV 56, 3, 221–255. Google ScholarDigital Library
    3. Baker, S., Scharstein, D., Lewis, J. P., Roth, S., Black, M. J., and Szeliski, R. 2011. A database and evaluation methodology for optical flow. IJCV 92, 1, 1–31. Google ScholarDigital Library
    4. Bloom, V., Makris, D., and Argyriou, V. 2012. G3d: A gaming action dataset and real time action recognition evaluation framework. In CVPR Workshops, 7–12.Google Scholar
    5. Bryan, N. J., Smaragdis, P., and Mysore, G. J. 2012. Clustering and synchronizing multi-camera video via landmark cross-correlation. In ICASSP, 2389–2392.Google Scholar
    6. Caspi, Y., and Irani, M. 2002. Spatio-temporal alignment of sequences. IEEE TPAMI 24, 11, 1409–1424. Google ScholarDigital Library
    7. Diego, F., Ponsa, D., Serrat, J., and López, A. M. 2011. Video alignment for change detection. IEEE Transactions on Image Processing 20, 7, 1858–1869. Google ScholarDigital Library
    8. Diego, F., Serrat, J., and López, A. M. 2013. Joint spatio-temporal alignment of sequences. IEEE Transactions on Multimedia 15, 6, 1377–1387.Google ScholarDigital Library
    9. Dijkstra, E. W. 1959. A note on two problems in connexion with graphs. Numerische Mathematik 1, 269–271. Google ScholarDigital Library
    10. Evangelidis, G. D., and Bauckhage, C. 2013. Efficient subframe video alignment using short descriptors. IEEE TPAMI 35, 10, 2371–2386. Google ScholarDigital Library
    11. Jiang, Y.-G., Ngo, C.-W., and Yang, J. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, 494–501. Google ScholarDigital Library
    12. Kang, S. B., Uyttendaele, M., Winder, S. A. J., and Szeliski, R. 2003. High dynamic range video. ACM Trans. Graph. 22, 3, 319–325. Google ScholarDigital Library
    13. Li, R., and Chellappa, R. 2010. Aligning spatio-temporal signals on a special manifold. In ECCV (5), 547–560. Google ScholarDigital Library
    14. Liu, C., Yuen, J., and Torralba, A. 2011. Sift flow: Dense correspondence across scenes and its applications. IEEE TPAMI 33, 5, 978–994. Google ScholarDigital Library
    15. Lowe, D. G. 1999. Object recognition from local scale-invariant features. In ICCV, 1150–1157. Google ScholarDigital Library
    16. Ngo, C.-W., Ma, Y.-F., and Zhang, H. 2005. Video summarization and scene detection by graph modeling. IEEE Trans. Circuits Syst. Video Techn. 15, 2, 296–305. Google ScholarDigital Library
    17. Pádua, F. L. C., Carceroni, R. L., Santos, G. A. M. R., and Kutulakos, K. N. 2010. Linear sequence-to-sequence alignment. IEEE TPAMI 32, 2, 304–320. Google ScholarDigital Library
    18. Prim, R. C. 1957. Shortest connection networks and some generalizations. Bell system technical journal 36, 6, 1389–1401.Google Scholar
    19. Rao, C., Gritai, A., Shah, M., and Syeda-Mahmood, T. F. 2003. View-invariant alignment and matching of video sequences. In ICCV, 939–945. Google ScholarDigital Library
    20. Rüegg, J., Wang, O., Smolic, A., and Gross, M. H. 2013. Ducttake: Spatiotemporal video compositing. Comput. Graph. Forum 32, 2, 51–61.Google ScholarCross Ref
    21. Sand, P., and Teller, S. J. 2004. Video matching. ACM Trans. Graph. 23, 3, 592–599. Google ScholarDigital Library
    22. Scharstein, D., and Szeliski, R. 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV 47, 1–3, 7–42. Google ScholarDigital Library
    23. Shrestha, P., Barbieri, M., and Weda, H. 2007. Synchronization of multi-camera video recordings based on audio. In ACM Multimedia, 545–548. Google ScholarDigital Library
    24. Summa, B., Tierny, J., and Pascucci, V. 2012. Panorama weaving: fast and flexible seam processing. ACM Trans. Graph. 31, 4, 83. Google ScholarDigital Library
    25. Ukrainitz, Y., and Irani, M. 2006. Aligning sequences and actions by maximizing space-time correlations. In ECCV (3), 538–550. Google ScholarDigital Library
    26. Vedaldi, A., and Fulkerson, B., 2008. VLFeat: An open and portable library of computer vision algorithms.Google Scholar
    27. Yücer, K., Jacobson, A., Hornung, A., and Sorkine, O. 2012. Transfusive image manipulation. ACM Trans. Graph. 31, 6, 176. Google ScholarDigital Library
    28. Zhou, F., and la Torre, F. D. 2009. Canonical time warping for alignment of human behavior. In NIPS, 2286–2294.Google Scholar
    29. Zhou, F., and la Torre, F. D. 2012. Generalized time warping for multi-modal alignment of human motion. In CVPR, 1282–1289. Google ScholarDigital Library
    30. Zimmer, H., Bruhn, A., and Weickert, J. 2011. Optic flow in harmony. IJCV 93, 3, 368–388. Google ScholarDigital Library

ACM Digital Library Publication: