“Computational video editing for dialogue-driven scenes” by Leake, Davis, Truong and Agrawala

  • ©Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala



Session Title:



    Computational video editing for dialogue-driven scenes




    We present a system for efficiently editing video of dialogue-driven scenes. The input to our system is a standard film script and multiple video takes, each capturing a different camera framing or performance of the complete scene. Our system then automatically selects the most appropriate clip from one of the input takes, for each line of dialogue, based on a user-specified set of film-editing idioms. Our system starts by segmenting the input script into lines of dialogue and then splitting each input take into a sequence of clips time-aligned with each line. Next, it labels the script and the clips with high-level structural information (e.g., emotional sentiment of dialogue, camera framing of clip, etc.). After this pre-process, our interface offers a set of basic idioms that users can combine in a variety of ways to build custom editing styles. Our system encodes each basic idiom as a Hidden Markov Model that relates editing decisions to the labels extracted in the pre-process. For short scenes (< 2 minutes, 8–16 takes, 6–27 lines of dialogue) applying the user-specified combination of idioms to the pre-processed inputs generates an edited sequence in 2–3 seconds. We show that this is significantly faster than the hours of user time skilled editors typically require to produce such edits and that the quick feedback lets users iteratively explore the space of edit designs.


    1. Ido Arev, Hyun Soo Park, Yaser Sheikh, Jessica Hodgins, and Ariel Shamir. 2014. Automatic editing of footage from multiple social cameras. ACM Transactions on Graphics (TOG) 33, 4 (2014), 81.Google ScholarDigital Library
    2. Daniel Arijon. 1976. Grammar of the film language. Focal Press London.Google Scholar
    3. Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. OpenFace: an open source facial behavior analysis toolkit. In IEEE Winter Conference on Applications of Computer Vision. Google ScholarCross Ref
    4. Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for placing cuts and transitions in interview video. ACM Transactions on Graphics (TOG) 31, 4 (2012), 67.Google ScholarDigital Library
    5. Steven Bird. 2006. NLTK: The natural language toolkit. In Proc. of COLING/ACL. 69–72.Google ScholarDigital Library
    6. Christopher J Bowen. 2013. Grammar of the Edit. CRC Press.Google Scholar
    7. Zachary Byers, Michael Dixon, Kevin Goodier, Cindy M Grimm, and William D Smart. 2003. An autonomous robot photographer. In Proc. of IROS, Vol. 3. 2636–2641.Google Scholar
    8. Pei-Yu Chi, Joyce Liu, Jason Linder, Mira Dontcheva, Wilmot Li, and Björn Hartmann. 2013. Democut: Generating concise instructional videos for physical demonstrations. In Proc. of UIST. 141–150. Google ScholarDigital Library
    9. David B Christianson, Sean E Anderson, Li-wei He, David H Salesin, Daniel S Weld, and Michael F Cohen. 1996. Declarative camera control for automatic cinematography. In AAAI/IAAI, Vol. 1. 148–155.Google ScholarDigital Library
    10. David K Elson and Mark O Riedl. 2007. A Lightweight Intelligent Virtual Cinematography System for Machinima Production.. In AIIDE. 8–13.Google Scholar
    11. G David Forney. 1973. The Viterbi algorithm. Proc. IEEE 61, 3 (1973), 268–278. Google ScholarCross Ref
    12. Quentin Galvane, Rémi Ronfard, Marc Christie, and Nicolas Szilas. 2014. Narrative-driven camera control for cinematic replay of computer games. In Proceedings of the Seventh International Conference on Motion in Games. 109–117. Google ScholarDigital Library
    13. Quentin Galvane, Rémi Ronfard, Christophe Lino, and Marc Christie. 2015. Continuity editing for 3d animation. In AAAI Conference on Artificial Intelligence.Google Scholar
    14. Vineet Gandhi and Remi Ronfard. 2015. A computational framework for vertical video editing. In 4th Workshop on Intelligent Camera Control, Cinematography and Editing.Google Scholar
    15. Vineet Gandhi, Remi Ronfard, and Michael Gleicher. 2014. Multi-clip video editing from a single viewpoint. In Proceedings of the 11th European Conference on Visual Media Production. 9. Google ScholarDigital Library
    16. Andreas Girgensohn, John Boreczky, Patrick Chiu, John Doherty, Jonathan Foote, Gene Golovchinsky, Shingo Uchihashi, and Lynn Wilcox. 2000. A semi-automatic approach to home video editing. In Proc. of UIST. 81–89. Google ScholarDigital Library
    17. Li-wei He, Michael F Cohen, and David H Salesin. 1996. The virtual cinematographer: A paradigm for automatic real-time camera control and directing. In Proc. of SIGGRAPH. 217–224.Google Scholar
    18. Rachel Heck, Michael Wallick, and Michael Gleicher. 2007. Virtual videography ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) 3, 1 (2007), 4.Google Scholar
    19. IBM. 2016. IBM Speech to Text Service. https://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/speech-to-text/. (2016). Accessed 2016-12-17.Google Scholar
    20. Arnav Jhala and Robert Michael Young. 2005. A discourse planning approach to cinematic camera control for narratives in virtual environments. In AAAI, Vol. 5. 307–312.Google Scholar
    21. Niels Joubert, Jane L E, Dan B Goldman, Floraine Berthouzoz, Mike Roberts, James A Landay, and Pat Hanrahan. 2016. Towards a Drone Cinematographer: Guiding Quadrotor Cameras using Visual Composition Principles. arXiv preprint arXiv:1610.01691 (2016).Google Scholar
    22. Peter Karp and Steven Feiner. 1993. Automated presentation planning of animation using task decomposition with heuristic reasoning. In Graphics Interface. 118–118.Google Scholar
    23. Steven Douglas Katz. 1991. Film directing shot by shot: Visualizing from concept to screen. Gulf Professional Publishing.Google Scholar
    24. Myung-Jin Kim, Tae-Hoon Song, Seung-Hun Jin, Soon-Mook Jung, Gi-Hoon Go, Key-Ho Kwon, and Jae-Wook Jeon. 2010. Automatically available photographer robot for controlling composition and taking pictures. In Proc. of IROS. 6010–6015.Google Scholar
    25. Christophe Lino, Mathieu Chollet, Marc Christie, and Rémi Ronfard. 2011. Computational model of film editing for interactive storytelling. In International Conference on Interactive Digital Storytelling. Springer, 305–308. Google ScholarDigital Library
    26. Qiong Liu, Yong Rui, Anoop Gupta, and Jonathan J Cadiz. 2001. Automating camera management for lecture room environments. In Proc. of CHI. 442–449. Google ScholarDigital Library
    27. Bilal Merabti, Marc Christie, and Kadi Bouatouch. 2015. A Virtual Director Using Hidden Markov Models. In Computer Graphics Forum. Wiley Online Library.Google Scholar
    28. W Murch. 2001. In the Blink of an Eye (Revised 2nd Edition). (2001).Google Scholar
    29. Robert Ochshorn and Max Hawkins. 2016. Gentle: A Forced Aligner. https://lowerquality.com/gentle/. (2016). Accessed 2016-12-17.Google Scholar
    30. Amy Pavel, Dan B Goldman, Björn Hartmann, and Maneesh Agrawala. 2016. VidCrit: Video-based Asynchronous Video Review. In Proc. of UIST. ACM, 517–528. Google ScholarDigital Library
    31. Amy Pavel, Colorado Reed, Björn Hartmann, and Maneesh Agrawala. 2014. Video Digests: A Browsable, Skimmable Format for Informational Lecture Videos. In Proc. of UIST. 573–582. Google ScholarDigital Library
    32. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825–2830.Google ScholarDigital Library
    33. Lawrence R Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (1989), 257–286. Google ScholarCross Ref
    34. Abhishek Ranjan, Jeremy Birnholtz, and Ravin Balakrishnan. 2008. Improving meeting capture by applying television production principles with audio and motion detection. In Proc. of CHI. 227–236. Google ScholarDigital Library
    35. April Rider. 2016. For a Few Days More: Screenplay Formatting Guide. https://www.oscars.org/sites/oscars/files/scriptsample.pdf. (2016). Accessed 2016-12-17.Google Scholar
    36. Remi Ronfard, Vineet Gandhi, and Laurent Boiron. 2013. The Prose Storyboard Language. In AAAI Workshop on Intelligent Cinematography and Editing, Vol. 3.Google Scholar
    37. Steve Rubin, Floraine Berthouzoz, Gautham J Mysore, and Maneesh Agrawala. 2015. Capture-Time Feedback for Recording Scripted Narration. In Proc. of UIST. 191–199. Google ScholarDigital Library
    38. Steve Rubin, Floraine Berthouzoz, Gautham J Mysore, Wilmot Li, and Maneesh Agrawala. 2013. Content-based tools for editing audio stories. In Proc. of UIST. 113–122. Google ScholarDigital Library
    39. Barry Salt. 2011. Reaction time: How to edit movies. New Review of Film and Television Studies 9, 3 (2011), 341–357. Google ScholarCross Ref
    40. Hijung Valentina Shin, Floraine Berthouzoz, Wilmot Li, and Frédo Durand. 2015. Visual Transcripts: Lecture Notes from Blackboard-style Lecture Videos. ACM Trans. Graph. 34, 6 (2015), 240:1–240:10.Google ScholarDigital Library
    41. Hijung Valentina Shin, Wilmot Li, and Frédo Durand. 2016. Dynamic Authoring of Audio with Linked Scripts. In Proc. of UIST. 509–516. Google ScholarDigital Library
    42. Tim J Smith and John M Henderson. 2008. Edit Blindness: The relationship between attention and global change blindness in dynamic scenes. Journal of Eye Movement Research 2, 2 (2008).Google Scholar
    43. Yoshinao Takemae, Kazuhiro Otsuka, and Naoki Mukawa. 2003. Video Cut Editing Rule Based on Participants’ Gaze in Multiparty Conversation. In Proc. of Multimedia. 303–306.Google ScholarDigital Library
    44. Anh Truong, Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2016. Quickcut: An interactive tool for editing narrated video. In Proc. of UIST. 497–507. Google ScholarDigital Library
    45. Andrew Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. on Information Theory 13, 2 (1967), 260–269. Google ScholarDigital Library
    46. Hui-Yin Wu and Marc Christie. 2016. Analysing Cinematography with Embedded Constrained Patterns. (2016).Google Scholar
    47. Vilmos Zsombori, Michael Frantzis, Rodrigo Laiola Guimaraes, Marian Florin Ursu, Pablo Cesar, Ian Kegel, Roland Craigie, and Dick CA Bulterman. 2011. Automatic generation of video narratives from shared UGC. In Proc. of Hypertext. 325–334.Google ScholarDigital Library

ACM Digital Library Publication: