“VoCo: text-based insertion and replacement in audio narration” by Jin, Mysore, DiVerdi, Lu and Finkelstein

  • ©Zeyu Jin, Gautham J. Mysore, Stephen DiVerdi, Jingwan Lu, and Adam Finkelstein




    VoCo: text-based insertion and replacement in audio narration

Session/Category Title: Speech and Facial Animation




    Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper presents studies showing that the output of our method is preferred over baseline methods and often indistinguishable from the original voice.


    1. Acapela Group. 2016. http://www.acapela-group.com. (2016). Accessed: 2016-04-10.Google Scholar
    2. Ryo Aihara, Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki. 2014. Voice conversion based on Non-negative matrix factorization using phoneme-categorized dictionary. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014). Google ScholarCross Ref
    3. Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for placing cuts and transitions in interview video. ACM Trans. on Graphics (TOG) 31, 4 (2012), 67.Google ScholarDigital Library
    4. Paulus Petrus Gerardus Boersma et al. 2002. Praat, a system for doing phonetics by computer. Glot international 5 (2002).Google Scholar
    5. Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’97). 353–360. Google ScholarDigital Library
    6. Juan Casares, A Chris Long, Brad A Myers, Rishi Bhatnagar, Scott M Stevens, Laura Dabbish, Dan Yocum, and Albert Corbett. 2002. Simplifying video editing using metadata. In Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques. ACM, 157–166.Google ScholarDigital Library
    7. Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. 2014. Voice conversion using deep neural networks with layer-wise generative training. Audio, Speech, and Language Processing, IEEE/ACM Transactions on 22, 12 (2014), 1859–1872.Google ScholarDigital Library
    8. Alistair D Conkie and Stephen Isard. 1997. Optimal coupling of diphones. In Progress in speech synthesis. Springer, 293–304.Google Scholar
    9. Srinivas Desai, E Veera Raghavendra, B Yegnanarayana, Alan W Black, and Kishore Prahallad. 2009. Voice conversion using Artificial Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2009). Google ScholarDigital Library
    10. Thierry Dutoit, Andre Holzapfel, Matthieu Jottrand, Alexis Moinet, J Prez, and Yannis Stylianou. 2007. Towards a Voice Conversion System Based on Frame Selection. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007).Google ScholarCross Ref
    11. G David Forney. 1973. The Viterbi algorithm. Proc. IEEE 61, 3 (1973), 268–278. Google ScholarCross Ref
    12. Kei Fujii, Jun Okawa, and Kaori Suigetsu. 2007. High-Individuality Voice Conversion Based on Concatenative Speech Synthesis. International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering 1, 11 (2007), 1617 — 1622.Google Scholar
    13. François G. Germain, Gautham J. Mysore, and Takako Fujioka. 2016. Equalization Matching of Speech Recordings in Real-World Environments. In 41st IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2016). Google ScholarCross Ref
    14. Aaron Hertzmann, Charles E Jacobs, Nuria Oliver, Brian Curless, and David H Salesin. 2001. Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. ACM, 327–340. Google ScholarDigital Library
    15. Andrew J Hunt and Alan W Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 1996). 373–376.Google ScholarDigital Library
    16. Zeyu Jin, Adam Finkelstein, Stephen DiVerdi, Jingwan Lu, and Gautham J. Mysore. 2016. CUTE: a concatenative method for voice conversion using exemplar-based unit selection. In 41st IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2016). Google ScholarCross Ref
    17. Alexander Kain and Michael W Macon. 1998. Spectral voice conversion for text-to-speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 1998). 285–288. Google ScholarCross Ref
    18. Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. 2008. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2008). 3933–3936. Google ScholarCross Ref
    19. John Kominek and Alan W Black. 2004. The CMU Arctic speech databases. In Fifth ISCA Workshop on Speech Synthesis.Google Scholar
    20. Robert F. Kubichek. 1993. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing. 125–128. Google ScholarCross Ref
    21. Sergey Levine, Christian Theobalt, and Vladlen Koltun. 2009. Real-time Prosody-driven Synthesis of Body Language. ACM Trans. Graph. 28, 5, Article 172 (Dec. 2009), 10 pages. Google ScholarDigital Library
    22. Jingwan Lu, Fisher Yu, Adam Finkelstein, and Stephen DiVerdi. 2012. HelpingHand: Example-based Stroke Stylization. ACM Trans. Graph. 31, 4, Article 46 (July 2012), 10 pages.Google Scholar
    23. Michal Lukáč, Jakub Fišer, Jean-Charles Bazin, Ondřej Jamriška, Alexander Sorkine-Hornung, and Daniel Sýkora. 2013. Painting by Feature: Texture Boundaries for Example-based Image Creation. ACM Trans. Graph. 32, 4, Article 116 (July 2013), 8 pages.Google ScholarDigital Library
    24. Anderson F Machado and Marcelo Queiroz. 2010. Voice conversion: A critical survey. Proc. Sound and Music Computing (SMC) (2010), 1–8.Google Scholar
    25. Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. 2010. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv preprint arXiv:1003.4083 (2010).Google Scholar
    26. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).Google Scholar
    27. Amy Pavel, Dan B. Goldman, Björn Hartmann, and Maneesh Agrawala. 2015. SceneSkim: Searching and Browsing Movies Using Synchronized Captions, Scripts and Plot Summaries. In Proceedings of the 28th annual ACM symposium on User interface software and technology (UIST 2015). 181–190. Google ScholarDigital Library
    28. Amy Pavel, Björn Hartmann, and Maneesh Agrawala. 2014. Video digests: A browsable, skimmable format for informational lecture videos. In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST 2014). 573–582. Google ScholarDigital Library
    29. Bhiksha Raj, Tuomas Virtanen, Sourish Chaudhuri, and Rita Singh. 2010. Non-negative matrix factorization based compensation of music for automatic speech recognition. In Interspeech 2010. 717–720.Google Scholar
    30. Marc Roelands and Werner Verhelst. 1993. Waveform similarity based overlap-add (WSOLA) for time-scale modification of speech: structures and evaluation. In EUROSPEECH 1993. 337–340.Google Scholar
    31. Steve Rubin, Floraine Berthouzoz, Gautham J. Mysore, Wilmot Li, and Maneesh Agrawala. 2013. Content-based Tools for Editing Audio Stories. In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology (UIST 2013). 113–122. Google ScholarDigital Library
    32. Kåre Sjölander. 2003. An HMM-based system for automatic segmentation and alignment of speech. In Proceedings of Fonetik 2003. 93–96.Google Scholar
    33. Matthew Stone, Doug DeCarlo, Insuk Oh, Christian Rodriguez, Adrian Stere, Alyssa Lees, and Chris Bregler. 2004. Speaking with Hands: Creating Animated Conversational Characters from Recordings of Human Performance. ACM Trans. Graph. 23, 3 (Aug. 2004), 506–513. Google ScholarDigital Library
    34. Yannis Stylianou, Olivier Cappé, and Eric Moulines. 1998. Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing 6, 2 (1998), 131–142. Google ScholarCross Ref
    35. Paul Taylor. 2009. Text-to-Speech Synthesis. Cambridge University Press. Google ScholarCross Ref
    36. Tomoki Toda, Alan W Black, and Keiichi Tokuda. 2007a. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory. IEEE Transactions on Audio, Speech, and Language Processing 15, 8 (2007), 2222–2235. Google ScholarDigital Library
    37. Tomoki Toda, Yamato Ohtani, and Kiyohiro Shikano. 2007b. One-to-many and many-to-one voice conversion based on eigenvoices. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007). IV-1249. Google ScholarCross Ref
    38. Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano. 2001. Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2001). 841–844. Google ScholarDigital Library
    39. Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yamagishi, and Keiichiro Oura. 2013. Speech Synthesis Based on Hidden Markov Models. Proc. IEEE 101, 5 (May 2013), 1234–1252. Google ScholarCross Ref
    40. Steve Whittaker and Brian Amento. 2004. Semantic Speech Editing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 2004). 527–534. Google ScholarDigital Library
    41. Zhizheng Wu, Tuomas Virtanen, Tomi Kinnunen, Engsiong Chng, and Haizhou Li. 2013. Exemplar-based unit selection for voice conversion utilizing temporal information. In INTERSPEECH 2013. 3057–3061.Google Scholar

ACM Digital Library Publication: