Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation

We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to “focus” the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).

References:

1. T. Afouras, J. S. Chung, and A. Zisserman. 2018. The Conversation: Deep Audio-Visual Speech Enhancement. In arXiv:1804.04121.Google Scholar
2. Anna Llagostera Casanovas, Gianluca Monaci, Pierre Vandergheynst, and Rémi Gribonval. 2010. Blind audiovisual source separation based on sparse redundant representations. IEEE Transactions on Multimedia 12, 5 (2010), 358–371. Google ScholarDigital Library
3. E Colin Cherry. 1953. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America 25, 5 (1953), 975–979.Google ScholarCross Ref
4. Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and Andrew Zisserman. 2016. Lip Reading Sentences in the Wild. CoRR abs/1611.05358 (2016).Google Scholar
5. Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, and William T Freeman. 2016. Synthesizing normalized faces from facial identity features. In CVPR’17.Google Scholar
6. Pierre Comon and Christian Jutten. 2010. Handbook of Blind Source Separation: Independent component analysis and applications. Academic press. Google ScholarDigital Library
7. Masood Delfarah and DeLiang Wang. 2017. Features for Masking-Based Monaural Speech Separation in Reverberant Conditions. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (2017), 1085–1094. Google ScholarDigital Library
8. Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. 2017. Improved Speech Reconstruction from Silent Video. In ICCV 2017 Workshop on Computer Vision for Audio-Visual Media.Google ScholarCross Ref
9. Hakan Erdogan, John R. Hershey, Shinji Watanabe, and Jonathan Le Roux. 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015).Google ScholarCross Ref
10. Weijiang Feng, Naiyang Guan, Yuan Li, Xiang Zhang, and Zhigang Luo. 2017. Audio visual speech recognition with multimodal recurrent neural networks. In Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 681–688.Google ScholarCross Ref
11. Aviv Gabbay, Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. 2018. Seeing Through Noise: Speaker Separation and Enhancement using Visually-derived Speech. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018).Google Scholar
12. Aviv Gabbay, Asaph Shamir, and Shmuel Peleg. 2017. Visual Speech Enhancement using Noise-Invariant Training. arXiv preprint arXiv:1711.08789 (2017).Google Scholar
13. R Gao, R Feris, and K. Grauman. 2018. Learning to Separate Object Sounds by Watching Unlabeled Video. arXiv preprint arXiv:1804.01665 (2018).Google Scholar
14. Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017.Google ScholarDigital Library
15. Elana Zion Golumbic, Gregory B Cogan, Charles E. Schroeder, and David Poeppel. 2013. Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party”. The Journal of neuroscience : the official journal of the Society for Neuroscience 33 4 (2013), 1417–26.Google ScholarCross Ref
16. Naomi Harte and Eoin Gillen. 2015. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia 17, 5 (2015), 603–615.Google ScholarDigital Library
17. David F. Harwath, Antonio Torralba, and James R. Glass. 2016. Unsupervised Learning of Spoken Language with Visual Context. In NIPS. Google ScholarDigital Library
18. John Hershey, Hagai Attias, Nebojsa Jojic, and Trausti Kristjansson. 2004. Audio-visual graphical models for speech processing. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarCross Ref
19. John R Hershey and Michael Casey. 2002. Audio-visual sound separation via hidden Markov models. In Advances in Neural Information Processing Systems. 1173–1180. Google ScholarDigital Library
20. John R. Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. 2016. Deep clustering: Discriminative embeddings for segmentation and separation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), 31–35.Google ScholarDigital Library
21. Andrew Hines, Eoin Gillen, Damien Kelly, Jan Skoglund, Anil C. Kokaram, and Naomi Harte. 2015. ViSQOLAudio: An objective audio quality metric for low bitrate codecs. The Journal of the Acoustical Society of America 137 6 (2015), EL449–55.Google Scholar
22. Andrew Hines and Naomi Harte. 2012. Speech Intelligibility Prediction Using a Neurogram Similarity Index Measure. Speech Commun. 54, 2 (Feb. 2012), 306–320. Google ScholarDigital Library
23. Ken Hoover, Sourish Chaudhuri, Caroline Pantofaru, Malcolm Slaney, and Ian Sturdy. 2017. Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers. CoRR abs/1706.00079 (2017).Google Scholar
24. Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Jen-Chun Lin, Yu Tsao, Hsiu-Wen Chang, and Hsin-Min Wang. 2018. Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks. IEEE Transactions on Emerging Topics in Computational Intelligence 2, 2 (2018), 117–128.Google ScholarCross Ref
25. Yongtao Hu, Jimmy SJ Ren, Jingwen Dai, Chang Yuan, Li Xu, and Wenping Wang. 2015. Deep multimodal speaker naming. In Proceedings of the 23rd ACM international conference on Multimedia. ACM, 1107–1110. Google ScholarDigital Library
26. Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML. Google ScholarDigital Library
27. Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, and John R Hershey. 2016. Single-Channel Multi-Speaker Separation Using Deep Clustering. Interspeech (2016), 545–549.Google Scholar
28. Faheem Khan. 2016. Audio-visual speaker separation. Ph.D. Dissertation. University of East Anglia.Google Scholar
29. Wei Ji Ma, Xiang Zhou, Lars A. Ross, John J. Foxe, and Lucas C. Parra. 2009. Lip-Reading Aids Word Recognition Most in Moderate Noise: A Bayesian Explanation Using High-Dimensional Feature Space. PLoS ONE 4 (2009), 233 — 252.Google ScholarCross Ref
30. Josh H McDermott. 2009. The cocktail party problem. Current Biology 19, 22 (2009), R1024–R1027.Google ScholarCross Ref
31. Gianluca Monaci. 2011. Towards real-time audiovisual speaker localization. In Signal Processing Conference, 2011 19th European. IEEE, 1055–1059.Google Scholar
32. Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel. 2015. Deep multimodal learning for audio-visual speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2130–2134.Google ScholarCross Ref
33. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal Deep Learning. In ICML. Google ScholarDigital Library
34. Andrew Owens and Alexei A Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. (2018).Google Scholar
35. Eric K. Patterson, Sabri Gurbuz, Zekeriya Tufekci, and John N. Gowdy. 2002. Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus. EURASIP J. Adv. Sig. Proc. 2002 (2002), 1189–1201. Google ScholarDigital Library
36. Jie Pu, Yannis Panagakis, Stavros Petridis, and Maja Pantic. 2017. Audio-visual object localization and separation using low-rank and sparsity. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2901–2905.Google ScholarCross Ref
37. Bertrand Rivet, Wenwu Wang, Syed M. Naqvi, and Jonathon A. Chambers. 2014. Audio-visual Speech Source Separation: An overview of key methodologies. IEEE Signal Processing Magazine 31 (2014), 125–134.Google ScholarCross Ref
38. Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, Vol. 2. IEEE, 749–752. Google ScholarDigital Library
39. Ethan M Rudd, Manuel Günther, and Terrance E Boult. 2016. Moon: A mixed objective optimization network for the recognition of facial attributes. In European Conference on Computer Vision. Springer, 19–35.Google ScholarCross Ref
40. J S Garofolo, Lori Lamel, W M Fisher, Jonathan Fiscus, D S. Pallett, N L. Dahlgren, and V Zue. 1992. TIMIT Acoustic-phonetic Continuous Speech Corpus. (11 1992).Google Scholar
41. Lei Sun, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2017. Multiple-target deep learning for LSTM-RNN based speech enhancement. In HSCMA.Google Scholar
42. Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 4214–4217.Google ScholarCross Ref
43. Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. 2013. The second `chime’ speech separation and recognition challenge: Datasets, tasks and baselines. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2013), 126–130.Google ScholarCross Ref
44. E. Vincent, R. Gribonval, and C. Fevotte. 2006. Performance Measurement in Blind Audio Source Separation. Trans. Audio, Speech and Lang. Proc. 14, 4 (2006), 1462–1469. Google ScholarDigital Library
45. DeLiang Wang and Jitong Chen. 2017. Supervised Speech Separation Based on Deep Learning: An Overview. CoRR abs/1708.07524 (2017).Google Scholar
46. Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22, 12 (2014), 1849–1858. Google ScholarDigital Library
47. Ziteng Wang, Xiaofei Wang, Xu Li, Qiang Fu, and Yonghong Yan. 2016. Oracle performance investigation of the ideal masks. In IWAENC.Google Scholar
48. Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, and Björn W. Schuller. 2015. Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR. In LVA/ICA. Google ScholarDigital Library
49. Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen. 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), 241–245.Google ScholarCross Ref
50. Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818–833.Google ScholarCross Ref
51. Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The Sound of Pixels. (2018).Google Scholar
52. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2014. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856 (2014).Google Scholar

ACM Digital Library Publication:

Overview Page:

SIGGRAPH 2018: Technical Papers

“Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation” by Ephrat, Mosseri, Lang, Dekel, Wilson, et al. …

Conference:

Type(s):

Entry Number: 112

Title:

Session/Category Title: Sounds Good!

Presenter(s)/Author(s):

Moderator(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Sponsored by: