“Binaural audio generation via multi-task learning” by Li, Liu and Manocha
Conference:
Type(s):
Title:
- Binaural audio generation via multi-task learning
Session/Category Title: Audio and Visual Displays
Presenter(s)/Author(s):
Abstract:
We present a learning-based approach for generating binaural audio from mono audio using multi-task learning. Our formulation leverages additional information from two related tasks: the binaural audio generation task and the flipped audio classification task. Our learning model extracts spatialization features from the visual and audio input, predicts the left and right audio channels, and judges whether the left and right channels are flipped. First, we extract visual features using ResNet from the video frames. Next, we perform binaural audio generation and flipped audio classification using separate subnetworks based on visual features. Our learning method optimizes the overall loss based on the weighted sum of the losses of the two tasks. We train and evaluate our model on the FAIR-Play dataset and the YouTube-ASMR dataset. We perform quantitative and qualitative evaluations to demonstrate the benefits of our approach over prior techniques.
References:
1. Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018b. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 1–11.
2. Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018a. The conversation: deep audio-visual speech enhancement. In Proceedings of Interspeech. 3244–3248.
3. Andrew Allen and Nikunj Raghuvanshi. 2015. Aerophones in flatland: Interactive wave simulation of wind instruments. ACM Transactions on Graphics (TOG) 34, 4 (2015), 1–11.
4. Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: learning sound representations from unlabeled video. In Proceedings of the Advances in Neural Information Processing Systems. 892–900.
5. Chunxiao Cao, Zhong Ren, Carl Schissler, Dinesh Manocha, and Kun Zhou. 2016. Interactive sound propagation with bidirectional path tracing. ACM Transactions on Graphics (TOG) 35, 6 (2016), 1–11.
6. Pritish Chandna, Marius Miron, Jordi Janer, and Emilia Gómez. 2017. Monoaural audio source separation using deep convolutional neural networks. In Proceedings of International Conference on Latent Variable Analysis and Signal Separation. 258–266.
7. Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari. 2009. Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons.
8. Li Deng, Geoffrey Hinton, and Brian Kingsbury. 2013. New types of deep neural network learning for speech recognition and related applications: An overview. In IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 8599–8603.
9. Chuang Gan, Deng Huang, Hang Zhao, Joshua B Tenenbaum, and Antonio Torralba. 2020. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10478–10487.
10. Ruohan Gao, Rogerio Feris, and Kristen Grauman. 2018. Learning to separate object sounds by watching unlabeled video. In Proceedings of the European Conference on Computer Vision (ECCV). 35–53.
11. Ruohan Gao and Kristen Grauman. 2019a. 2.5D visual sound. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 324–333.
12. Ruohan Gao and Kristen Grauman. 2019b. Co-separating sounds of visual objects. In Proceedings of the IEEE International Conference on Computer Vision. 3879–3888.
13. Ross Girshick. 2015. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 1440–1448.
14. Wangli Hao, Zhaoxiang Zhang, and He Guan. 2018. CMCGAN: a uniform framework for cross-modal visual-audio mutual generation. In Proceedings of the AAAI Conference on Artificial Intelligence. 6886–6893.
15. Simon Haykin and Zhe Chen. 2005. The cocktail party problem. Neural Computation 17, 9 (2005), 1875–1902.
16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
17. Ken Hoover, Sourish Chaudhuri, Caroline Pantofaru, Malcolm Slaney, and Ian Sturdy. 2017. Putting a face to the voice: fusing audio and visual signals across a video to determine speakers. arXiv preprint arXiv:1706.00079 (2017).
18. Di Hu, Feiping Nie, and Xuelong Li. 2018. Deep co-clustering for unsupervised audiovisual learning. arXiv preprint arXiv:1807.03094 (2018).
19. Hansung Kim, Luca Hernaggi, Philip JB Jackson, and Adrian Hilton. 2019. Immersive spatial audio reproduction for VR/AR using room acoustic modelling from 360° images. In Proceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces (VR). 120–126.
20. Diederik P Kingma and Jimmy Ba. 2014. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
21. Asbjørn Krokstad, Staffan Strom, and Svein Sørsdal. 1968. Calculating the acoustical room response by the use of a ray tracing technique. Journal of Sound and Vibration 8, 1 (1968), 118–125.
22. Thomas Le Cornu and Ben Milner. 2017. Generating intelligible audio speech from visual speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 9 (2017), 1751–1761.
23. Dingzeyu Li, Timothy R Langlois, and Changxi Zheng. 2018. Scene-aware audio for 360° videos. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–12.
24. Shikun Liu, Edward Johns, and Andrew J. Davison. 2019. End-To-End Multi-Task Learning With Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
25. Shiguang Liu and Dinesh Manocha. 2020. Sound synthesis, propagation, and rendering: a survey. arXiv preprint arXiv:2011.05538 (2020).
26. Francesc Lluís, Vasileios Chatziioannou, and Alex Hofmann. 2021. Points2Sound: From mono to binaural audio using 3D point cloud scenes. arXiv preprint arXiv:2104.12462 (2021).
27. Yu-Ding Lu, Hsin-Ying Lee, Hung-Yu Tseng, and Ming-Hsuan Yang. 2019. Self-Supervised Audio Spatialization with Correspondence Classifier. In 2019 IEEE International Conference on Image Processing (ICIP). 3347–3351.
28. Josh H McDermott. 2009. The cocktail party problem. Current Biology 19, 22 (2009), R1024–R1027.
29. Ravish Mehra, Nikunj Raghuvanshi, Lakulish Antani, Anish Chandak, Sean Curtis, and Dinesh Manocha. 2013. Wave-based sound propagation in large open scenes using an equivalent source formulation. ACM Transactions on Graphics (TOG) 32, 2 (2013), 1–13.
30. Pedro Morgado, Nuno Nvasconcelos, Timothy Langlois, and Oliver Wang. 2018. Self-supervised generation of spatial audio for 360° video. In Proceedings of the Advances in Neural Information Processing Systems. 362–372.
31. Giovanni Morrone, Sonia Bergamaschi, Luca Pasa, Luciano Fadiga, Vadim Tikhanoff, and Leonardo Badino. 2019. Face landmark-based speaker-independent audiovisual speech enhancement in multi-talker environments. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6900–6904.
32. Arsha Nagrani, Samuel Albanie, and Andrew Zisserman. 2018. Learnable PINs: cross-modal embeddings for person identity. In Proceedings of the European Conference on Computer Vision (ECCV). 71–88.
33. Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman. 2016. Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2405–2413.
34. Sanjeel Parekh, Alexey Ozerov, Slim Essid, Ngoc QK Duong, Patrick Pérez, and Gaël Richard. 2019. Identify, locate and separate: audio-visual object extraction in large video collections using weak supervision. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 268–272.
35. Lord Rayleigh. 1875. On our perception of the direction of a source of sound. Proceedings of the Musical Association 2 (1875), 75–84.
36. Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).
37. Atul Rungta, Carl Schissler, Nicholas Rewkowski, Ravish Mehra, and Dinesh Manocha. 2018. Diffraction kernels for interactive sound propagation in dynamic environments. IEEE Transactions on Visualization and Computer Graphics 24, 4 (2018), 1613–1622.
38. Carl Schissler, Aaron Nicholls, and Ravish Mehra. 2016. Efficient HRTF-based spatial audio for area and volumetric sources. IEEE Transactions on Visualization and Computer Graphics 22, 4 (2016), 1356–1366.
39. Carl Schissler, Peter Stirling, and Ravish Mehra. 2017. Efficient construction of the spatial room impulse response. In IEEE Virtual Reality (VR). IEEE, 122–130.
40. Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. 2018. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4358–4366.
41. Zhenyu Tang, Nicholas J Bryan, Dingzeyu Li, Timothy R Langlois, and Dinesh Manocha. 2020. Scene-aware audio rendering via deep acoustic analysis. IEEE Transactions on Visualization and Computer Graphics 26, 5 (2020), 1991–2001.
42. Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV). 247–263.
43. Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. 2021. Multi-task learning for dense prediction tasks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
44. Tuomas Virtanen. 2007. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech, and Language Processing 15, 3 (2007), 1066–1074.
45. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
46. DeLiang Wang and Jitong Chen. 2018. Supervised speech separation based on deep learning: an overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, 10 (2018), 1702–1726.
47. Frederic L Wightman and Doris J Kistler. 1992. The dominant role of low-frequency interaural time differences in sound localization. The Journal of the Acoustical Society of America 91, 3 (1992), 1648–1661.
48. Olivia Wiles, A Sophia Koepke, and Andrew Zisserman. 2018. X2face: a network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV). 670–686.
49. Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, and Dahua Lin. 2021. Visually informed binaural audio generation without binaural audios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15485–15494.
50. Nelson Yalta, Shinji Watanabe, Kazuhiro Nakadai, and Tetsuya Ogata. 2019. Weakly-supervised deep recurrent neural networks for basic dance step generation. In Proceedings of the International Joint Conference on Neural Networks (IJCNN). 1–8.
51. Karren Yang, Bryan Russell, and Justin Salamon. 2020. Telling left from right: learning spatial correspondence of sight and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9932–9941.
52. Hengchin Yeh, Ravish Mehra, Zhimin Ren, Lakulish Antani, Dinesh Manocha, and Ming Lin. 2013. Wave-ray coupling for interactive sound propagation in large complex scenes. ACM Transactions on Graphics (TOG) 32, 6 (2013), 1–11.
53. William A Yost. 1997. The cocktail party problem: forty years later. Binaural and Spatial Hearing in Real and Virtual Environments (1997), 329–347.
54. Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen. 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 241–245.
55. Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Torralba. 2019. The sound of motions. In Proceedings of the IEEE International Conference on Computer Vision. 1735–1744.
56. Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV). 570–586.
57. Hang Zhou, Xudong Xu, Dahua Lin, Xiaogang Wang, and Ziwei Liu. 2020. Sep-stereo: visually guided stereophonic audio generation by associating source separation. In Proceedings of the European Conference on Computer Vision. 52–69.
58. Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L Berg. 2018. Visual to sound: generating natural sound for videos in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3550–3558.
59. Hao Zhu, Mandi Luo, Rui Wang, Aihua Zheng, and Ran He. 2020. Deep audio-visual learning: a survey. arXiv preprint arXiv:2001.04758 (2020).


