Project starline: a high-fidelity telepresence system

We present a real-time bidirectional communication system that lets two people, separated by distance, experience a face-to-face conversation as if they were copresent. It is the first telepresence system that is demonstrably better than 2D videoconferencing, as measured using participant ratings (e.g., presence, attentiveness, reaction-gauging, engagement), meeting recall, and observed nonverbal behaviors (e.g., head nods, eyebrow movements). This milestone is reached by maximizing audiovisual fidelity and the sense of copresence in all design elements, including physical layout, lighting, face tracking, multi-view capture, microphone array, multi-stream compression, loudspeaker output, and lenticular display. Our system achieves key 3D audiovisual cues (stereopsis, motion parallax, and spatialized audio) and enables the full range of communication cues (eye contact, hand gestures, and body language), yet does not require special glasses or body-worn microphones/headphones. The system consists of a head-tracked autostereoscopic display, high-resolution 3D capture and rendering subsystems, and network transmission using compressed color and depth video streams. Other contributions include a novel image-based geometry fusion algorithm, free-space dereverberation, and talker localization.

References:

1. Michael Argyle and Mark Cook. 1976. Gaze and mutual gaze. Cambridge U Press.
2. Arthur Aron, Edward Melinat, Elaine. N. Aron, Robert D. Vallone, and Renee J. Bator. 1997. The experimental generation of interpersonal closeness: A procedure and some preliminary findings. Personality and Social Psychology Bulletin 23, 4 (1997), 363–377.
3. H. Harlyn Baker, Donald Tanguay, Irwin Sobel, Dan Gelb, Michael E. Goss, W. Bruce Culbertson, and Thomas Malzbender. 2002. The Coliseum immersive teleconferencing system. In Proceedings of the International Workshop on Immersive Telepresence, Vol. 6.
4. Danielle Blanch-Hartigan, Mollie A. Ruben, Judith A. Hall, and Marianne S. Mast. 2018. Measuring nonverbal behavior in clinical interaction: A pragmatic guide. Patient Education and Counseling 101, 12 (2018), 2209–2218.
5. Atanas Boev, Kalle Raunio, Mihail Georgiev, Atanas Gotchev, and Karen Egiazarian. 2008. OpenGL-based control of semi-active 3D display. In Proceedings of the 3DTV Conference. 125–128.
6. Reinhard Borner, Bernd Duckstein, Oliver Machui, Hans Roder, Thomas Sinnig, and Thomas Sikora. 2000. A family of single-user autostereoscopic displays with head-tracking capabilities. IEEE Transactions on Circuits and Systems for Video Technology 10, 2 (2000), 234–243.
7. Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael Cohen. 2001. Unstructured lumigraph rendering. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’01). ACM, 425–432.
8. Calibu Contributors. 2014. Calibu Calibration Library. https://github.com/arpg/calibu. [Online; accessed 17-December-2019].
9. Joe Caroselli, Izhak Shafran, Arun Narayanan, and Richard Rose. 2017. Adaptive multichannel dereverberation for automatic speech recognition. In Interspeech 2017. 3877–3881.
10. Tanya L. Chartrand and John A. Bargh. 1999. The chameleon effect: The perception-behavior link and social interaction. Journal of Personality and Social Psychology 76, 6 (1999), 893–910.
11. Jiawen Chen, Dennis Bautembach, and Shahram Izadi. 2013. Scalable real-time volumetric surface reconstruction. ACM Trans. Graph. 32, 4 (July 2013).
12. Renjie Chen, Andrew Maimone, Henry Fuchs, Ramesh Raskar, and Gordon Wetzstein. 2014. Wide field of view compressive light field display using a multilayer architecture and tracked viewers. Journal of the Society for Information Display 22, 10 (2014), 525–534.
13. Yan Chen, Toni Farley, and Nong Ye. 2004. QoS requirements of network applications on the Internet. Information Knowledge Systems Management 4, 1 (2004), 55–76.
14. Hang Chu, Shugao Ma, Fernando de la Torre, Sanja Fidler, and Yaser Sheikh. 2020. Expressive telepresence via modular codec avatars. In Proceedings of the European Conference on Computer Vision (ECCV).
15. Cisco Systems, Inc. 2011. Cisco TelePresence System T3 System Assembly Guide. (December 2011). https://www.cisco.com/c/dam/en/us/td/docs/telepresence/endpoint/t3/guides/t3_system_assembly_guide.pdf
16. Antonio Criminisi, Jamie Shotton, Andrew Blake, and Philip HS Torr. 2003. Gaze manipulation for one-to-one teleconferencing. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 3. 13–16.
17. Brian Curless and Marc Levoy. 1996. A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’96). ACM, 303–312.
18. Liyanage C De Silva, Mitsuho Tahara, Kiyoharu Aizawa, and Mitsutoshi Hatori. 1995. A teleconferencing system capable of multiple person eye contact (MPEC) using half mirrors and cameras placed at common points of extended lines of gaze. IEEE Transactions on Circuits and Systems for Video Technology 5, 4 (1995), 268–277.
19. Neil A Dodgson. 2005. Autostereoscopic 3D displays. Computer 38, 8 (2005), 31–36.
20. Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, et al. 2016. Fusion4D: Real-time performance capture of challenging scenes. ACM Transactions on Graphics (TOG) 35, 4 (2016), 1–13.
21. Mingsong Dou, Ying Shi, Jan-Michael Frahm, Henry Fuchs, Bill Mauchly, and Mod Marathe. 2012. Room-sized informal telepresence system. In IEEE Virtual Reality Workshops (VRW). 15–18.
22. John V. Draper, David B. Kaber, and John M. Usher. 1998. Telepresence. Human Factors 40, 3 (1998), 354–375.
23. DVE. 2014. DVE Unveils First-of-Its-Kind Holographic Presentation Room. (Jan 2014). https://www.prnewswire.com/news-releases/dve-unveils-first-of-its-kind-holographic-presentation-room-powered-by-microsoft-240082321.html
24. Gary Elko. 2004. Differential microphone arrays. In Audio Signal Processing for Next-Generation Multimedia Communication Systems, Y. Huang and J. Benetsy (Eds.). Springer, Boston, 11–65.
25. FaceDetector. 2019. https://developers.google.com/android/reference/com/google/android/gms/vision/face/FaceDetector.
26. Christian Frueh, Avneesh Sud, and Vivek Kwatra. 2017. Headset removal for virtual and mixed reality. In ACM SIGGRAPH 2017 Talks.
27. Henry Fuchs, Andrei State, and Jean-Charles Bazin. 2014. Immersive 3d telepresence. Computer 47, 7 (2014), 46–52.
28. Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and Victor Lempitsky. 2016. Deepwarp: Photorealistic image resynthesis for gaze manipulation. In Proceedings of the European Conference on Computer Vision. 311–326.
29. William Gardner. 1997. Head tracked 3-D audio using loudspeakers. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. 898–901.
30. Simon J Gibbs, Constantin Arapis, and Christian J Breiteneder. 1999. TELEPORT-Towards immersive copresence. Multimedia Systems 7, 3 (1999), 214–221.
31. David M. Green and John A. Swets. 1966. Signal Detection Theory and Psychophysics. Wiley & Sons, New York, USA.
32. Markus Hadwiger, Christian Sigg, Henning Scharsach, Khatja Bühler, and Markus Gross. 2005. Real-time ray-casting and advanced shading of discrete isosurfaces. In Computer graphics forum, Vol. 24. Wiley Online Library, 303–312.
33. Edward T. Hall. 1963. A system for the notation of proxemic behavior. American Anthropologist 65, 5 (1963), 1003–1026. http://www.jstor.org/stable/668580
34. Judith A. Hall, Terrence G. Horgan, and Nora A. Murphy. 2019. Nonverbal communication. Annual Review of Psychology 70 (2019), 271–294.
35. Zhe He, Adrian Spurr, Xucong Zhang, and Otmar Hilliges. 2019. Photo-realistic monocular gaze redirection using generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision.
36. Joel A. Hesch, Anastasios I. Mourikis, and Stergios I. Roumeliotis. 2008. Mirror-based extrinsic camera calibration. In Algorithmic Foundation of Robotics VIII, Selected Contributions of the Eight International Workshop on the Algorithmic Foundations of Robotics, WAFR 2008, Guanajuato, Mexico, December 7-9, 2008. 285–299.
37. Hewlett-Packard. 2005. HP unveils Halo Collaboration Studios: Life-like communication leaps across geographic boundaries. (Dec 2005). https://www8.hp.com/us/en/hp-news/press-release.html?id=170674
38. Torsten Hothorn, Kurt Hornik, Mark A. Van de Wiel, and Achim Zeileis. 2008. Implementing a class of permutation tests: The coin package. Journal of Statistical Software 28, 8 (2008), 1–23.
39. IEEE. 1969. Recommended practice for speech quality measurements. IEEE Trans. on Audio and Electroacoustics 17 (1969), 225–246. Issue 3.
40. Alan B. Johnston and Daniel C. Burnett. 2012. WebRTC: APIs and RTCWEB Protocols of the HTML5 Real-Time Web. Digital Codex LLC, USA.
41. Andrew Jones, Magnus Lang, Graham Fyffe, Xueming Yu, Jay Busch, Ian McDowall, Mark Bolas, and Paul Debevec. 2009. Achieving eye contact in a one-to-many 3D video teleconferencing system. ACM Transactions on Graphics 28, 3 (2009).
42. Silvio Jurk and René de la Barré. 2014. A new tracking approach for XYZ-user-adaptation of stereoscopic content. In Proceedings of the Electronic Displays Conference, Vol. 3.
43. Peter Kauff and Oliver Schreer. 2002. An immersive 3D video-conferencing system using shared virtual team user environments. In Proceedings of the International Conference on Collaborative Virtual Environments (CVE). 105–112.
44. Daniil Kononenko and Victor Lempitsky. 2015. Learning to look up: Realtime monocular gaze correction using machine learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4667–4675.
45. Claudia Kuster, Nicola Ranieri, Henning Zimmer, Jean-Charles Bazin, Chengzheng Sun, Tiberiu Popa, and Markus Gross. 2012. Towards next generation 3D teleconferencing systems. In 3DTV-CON: The True Vision-Capture, Transmission and Display of 3D Video. IEEE, 1–4.
46. Jaron Lanier. 2001. Virtually there. Scientific American 284, 4 (2001), 52–61.
47. Tobias Lentz. 2006. Dynamic crosstalk cancellation for binaural synthesis in virtual reality environments. J. Audio Eng. Soc 54, 4 (2006), 283–294.
48. Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep appearance models for face rendering. ACM Transactions on Graphics 37, 4 (2018).
49. Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. 2019. Neural volumes: Learning dynamic renderable volumes from images. ACM Transactions on Graphics 38, 4 (2019).
50. C. Neil Macrae, Bruce M. Hood, Alan B. Milne, Angela C. Rowe, and Malia F. Mason. 2002. Are you looking at me? Eye gaze and person perception. Psychological Science 13, 5 (2002), 460–464.
51. Andrew Maimone, Jonathan Bidwell, Kun Peng, and Henry Fuchs. 2012. Enhanced personal autostereoscopic telepresence system using commodity depth cameras. Computers & Graphics 36, 7 (2012), 791 — 807.
52. Andrew Maimone and Henry Fuchs. 2011. Encumbrance-free telepresence system with real-time 3D capture and display using commodity depth cameras. In IEEE International Symposium on Mixed and Augmented Reality. 137–146.
53. Andrew Maimone, Xubo Yang, Nate Dierk, Andrei State, Mingsong Dou, and Henry Fuchs. 2013. General-purpose telepresence with head-worn optical see-through displays and projector-based lighting. In IEEE Virtual Reality (VR). 23–26.
54. Aditi Majumder, W. Brent Seales, M. Gopi, and Henry Fuchs. 1999. Immersive teleconferencing: A new algorithm to generate seamless panoramic video imagery. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, USA, 169–178.
55. Wojciech Matusik and Hanspeter Pfister. 2004. 3D TV: A scalable system for real-time acquisition, transmission, and autostereoscopic display of dynamic scenes. ACM Transactions on Graphics 23, 3 (2004).
56. Calvin S McCamy, Harold Marcus, James G Davidson, et al. 1976. A color-rendition chart. J. App. Photog. Eng 2, 3 (1976), 95–99.
57. Lothar Muhlbach, Martin Bocker, and Angela Prussog. 1995. Telepresence in videocommunications: A study on stereoscopy and individual eye contact. Human Factors 37, 2 (1995), 290–305.
58. Matthias Niessner, Michael Zollhöfer, Shahram Izadi, and Marc Stamminger. 2013. Real-time 3D reconstruction at scale using voxel hashing. ACM Trans. Graph. 32, 6, Article 169 (Nov. 2013), 11 pages.
59. Harris Nover, Supreeth Achar, and Dan B Goldman. 2018. ESPReSSo: Efficient slanted PatchMatch for real-time spacetime stereo. In 2018 International Conference on 3D Vision (3DV). IEEE, 578–586.
60. Jeffrey A. Okun and Susan Zwerman (Eds.). 2010. The VES Handbook of Visual Effects: Industry Standard VFX Practices and Procedures. Focal Press, 569.
61. Manuel M Oliveira, Gary Bishop, and David McAllister. 2000. Relief texture mapping. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques. ACM, 359–368.
62. Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kowdle, Yury Degtyarev, David Kim, Philip L. Davidson, Sameh Khamis, Mingsong Dou, Vladimir Tankovich, Charles Loop, Qin Cai, Philip A. Chou, Sarah Mennicken, Julien Valentin, Vivek Pradeep, Shenlong Wang, Sing Bing Kang, Pushmeet Kohli, Yuliya Lutchyn, Cem Keskin, and Shahram Izadi. 2016. Holoportation: Virtual 3D teleportation in real-time. In Proceedings of the Symposium on User Interface Software and Technology (UIST).
63. Tomislav Pejsa, Julian Kantor, Hrvoje Benko, Eyal Ofek, and Andrew Wilson. 2016. Room2Room: Enabling life-size telepresence in a projected augmented reality environment. In Proceedings of the ACM Conference on Computer-Supported Cooperative Work & Social Computing.
64. Ken Perlin, Salvatore Paxia, and Joel S Kollin. 2000. An autostereoscopic display. In Proceedings of SIGGRAPH. 319–326.
65. Plantronics Inc. 2019. Polycom RealPresence Immersive Studio. https://www.polycom.com/hd-video-conferencing/realpresence-immersive-video-telepresence.html.
66. Ville Pulkki. 1997. Virtual sound source positioning using vector base amplitude panning. J. Audio Eng. Soc 45, 6 (1997), 456–466.
67. Ramesh Raskar, Greg Welch, Matt Cutts, Adam Lake, Lev Stesin, and Henry Fuchs. 1998. The Office of the Future: A unified approach to image-based modeling and spatially immersive displays. In Proceedings of SIGGRAPH.
68. William T. Reeves, David H. Salesin, and Robert L. Cook. 1987. Rendering antialiased shadows with depth maps. SIGGRAPH Comput. Graph. 21, 4 (Aug. 1987), 283–291.
69. Alexander Richard, Colin Lea, Shugao Ma, Jurgen Gall, Fernando de la Torre, and Yaser Sheikh. 2021. Audio- and gaze-driven facial animation of codec avatars. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 41–50.
70. Drew Schmidt and Christian Heckendorf. 2017. ngram: Fast n-gram tokenization. R-package.
71. Oliver Schreer, Nicole Brandenburg, Serap Askar, and Emanuele Trucco. 2001. A virtual 3d video-conferencing system providing semi-immersive telepresence: A real-time solution in hardware and software. In Proceedings of the International Conference on eWork and eBusiness. 184–190.
72. Myung-Suk Song, Cha Zhang, Dinei Florencio, and Hong-Goo Kang. 2010. Personal 3D audio system with loudspeakers. In IEEE International Conference on Multimedia and Expo. 1600–1605.
73. Sony. 2008. Sony 3D Telepresence. (2008). https://www.tzmc.us/sony/3d_telepresence/index.htm
74. Robert W. Stadler and William M. Rabinowitz. 1993. On the potential of fixed arrays for hearing aids. J. Acoust. Soc. Am. 94 (1993), 1332–1342.
75. Harold Stanislaw and Natasha Todorov. 1999. Calculation of signal detection theory measures. Behaviors Research Methods, Instruments, & Computers 31 (1999), 137–149.
76. Christopher H. Sterling and George Shiers. 2000. History of Telecommuniations Technology. Scarecrow Press.
77. Jian Sun, Yin Li, Sing Bing Kang, and Heung-Yeung Shum. 2006. Flash matting. ACM Trans. Graph. 25, 3 (July 2006), 772–778.
78. Tim Szigeti, Kevin McMenamy, Roland Saville, and Alan Glowacki. 2009. Cisco TelePresence Fundamentals (1st ed.).
79. Anthony Vetro, Thomas Wiegand, and Gary J Sullivan. 2011. Overview of the stereo and multiview video coding extensions of the H.264/MPEG-4 AVC standard. Proc. IEEE 99, 4 (2011), 626–642.
80. Bruce Warren. 2003. Photography: The Concise Guide. Thomson Learning Delmar. https://books.google.com/books?id=w0XJQFxD_S4C
81. Shih-En Wei, Jason Saragih, Tomas Simon, Adam W Harley, Stephen Lombardi, Michal Perdoch, Alexander Hypes, Dawei Wang, Hernan Badino, and Yaser Sheikh. 2019. VR facial animation via multiview image translation. ACM Transactions on Graphics 38, 4 (2019), 67.
82. Gordon Wetzstein, Douglas Lanman, Matthew Hirsch, and Ramesh Raskar. 2012. Tensor displays: Compressive light field synthesis using multilayer displays with directional backlighting. In ACM Transactions on Graphics (Proc. SIGGRAPH.
83. Frederic L Wightman and Doris J Kistler. 1992. The dominant role of low-frequency interaural time differences in sound localization. The Journal of the Acoustical Society of America 91, 3 (1992), 1648–1661.
84. Wikipedia. 2021. Exponential smoothing — Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=Exponential%20smoothing&oldid=1039465483#Double_exponential_smoothing. [Online; accessed 27-August-2021].
85. Lior Wolf, Ziv Freund, and Shai Avidan. 2010. An eye for an eye: A single camera gaze-replacement method. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 817–824.
86. Ruigang Yang, Celso Kurashima, Andrew Nashel, Herman Towles, Anselmo Lastra, and Henry Fuchs. 2002. Creating adaptive views for group video teleconferencing-An image-based approach. Screen 100, P2 (2002).
87. Ruigang Yang and Zhengyou Zhang. 2002. Eye gaze correction with stereovision for video-teleconferencing. In European Conference on Computer Vision. 479–494.
88. Cha Zhang, Qin Cai, Philip A. Chou, Zhengyou Zhang, and Ricardo Martin-Brualla. 2013. Viewport: A distributed, immersive teleconferencing system with infrared dot pattern. IEEE MultiMedia 20, 1 (2013), 17–27.
89. Zhengyou Zhang. 2000. A flexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence 22, 11 (2000), 1330–1334.

ACM Digital Library Publication:

Overview Page:

SIGGRAPH Asia 2021: Technical Papers

Submit a story:

If you would like to submit a story about this presentation, please contact us: historyarchives@siggraph.org

ACM SIGGRAPH HISTORY ARCHIVES

“Project starline: a high-fidelity telepresence system” by Lawrence, Goldman, Achar, Blascovich, Desloge, et al. …

Conference:

Type(s):

Title:

Session/Category Title:

Presenter(s)/Author(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Submit a story:

Sponsored by: