Automatically Distributing Eulerian and Hybrid Fluid Simulations in the Cloud

Distributing a simulation across many machines can drastically speed up computations and increase detail. The computing cloud provides tremendous computing resources, but weak service guarantees force programs to manage significant system complexity: nodes, networks, and storage occasionally perform poorly or fail.

We describe Nimbus, a system that automatically distributes grid-based and hybrid simulations across cloud computing nodes. The main simulation loop is sequential code and launches distributed computations across many cores. The simulation on each core runs as if it is stand-alone: Nimbus automatically stitches these simulations into a single, larger one. To do this efficiently, Nimbus introduces a four-layer data model that translates between the contiguous, geometric objects used by simulation libraries and the replicated, fine-grain objects managed by its underlying cloud computing runtime.

Using PhysBAM particle-level set fluid simulations, we demonstrate that Nimbus can run higher detail simulations faster, distribute simulations on up to 512 cores, and run enormous simulations (1024³ cells). Nimbus automatically manages these distributed simulations, balancing load across nodes and recovering from failures. Implementations of PhysBAM water and smoke simulations as well as an open source heat-diffusion simulation show that Nimbus is general and can support complex simulations.

Nimbus can be downloaded from https://nimbus.stanford.edu.

References:

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). 265–283. http://dl.acm.org/citation.cfm?id=3026877.3026899
Academy of Motion Picture Arts and Sciences. 2017. Oscar Sci-Tech Awards. Retrieved April 3, 2018, from http://www.oscars.org/sci-tech.
Jérémie Allard and Bruno Raffin. 2005. A shader-based parallel rendering framework. In Proceedings of IEEE Visualization (VIS’05). IEEE, Los Alamitos, CA, 127–134.
Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective straggler mitigation: Attack of the clones. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (NSDI’13). 185–198. http://dl.acm.org/citation.cfm?id=2482626.2482645
Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the outliers in map-reduce clusters using Mantri. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). 265–278. http://dl.acm.org/citation.cfm?id=1924943.1924962
Jason Ansel, Kapil Arya, and Gene Cooperman. 2009. DMTCP: Transparent checkpointing for cluster computations and the desktop. In Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing (IPDPS’09). IEEE, Los Alamitos, CA, 1–12.
David C. Arney and Joseph E. Flaherty. 1990. An adaptive mesh-moving and local refinement method for time-dependent partial differential equations. ACM Trans. Math. Softw. 16, 1, 48–71.
Michael Edward Bauer. 2014. Legion: Programming Distributed Heterogeneous Architectures With Logical Regions. Ph.D. Dissertation.
Gilbert Louis Bernstein, Chinmayee Shah, Crystal Lemire, Zachary Devito, Matthew Fisher, Philip Levis, and Pat Hanrahan. 2016. Ebb: A DSL for physical simulation on CPUs and GPUs. ACM Transactions on Graphics 35, 2, Article 21, 12 pages.
Umit V. Catalyurek, Erik G. Boman, Karen D. Devine, Doruk Bozdag, Robert Heaphy, and Lee Ann Riesen. 2007. Hypergraph-based dynamic load balancing for adaptive scientific computations. In Proceedings of the Parallel and Distributed Processing Symposium (IPDPS’07). IEEE, Los Alamitos, CA, 1–11.
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph Von Praun, and Vivek Sarkar. 2005. X10: An object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Notices 40, 519–538.
NVIDIA Corporation. 2017. CUDA C Programming Guide. Retrieved April 3, 2018, from https://docs.nvidia.com/cuda/cuda-c-programming-guide/.
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12). 1223–1231.
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1, 107–113.
Steven J. Deitz, Bradford L. Chamberlain, and Lawrence Snyder. 2004. Abstractions for dynamic data distribution. In Proceedings of the 9th International Workshops on High-Level Parallel Programming Models and Supportive Environments. IEEE, Los Alamitos, CA, 42–51.
Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. 2016. Distributed Halide. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’16). ACM, New York, NY, Article 5, 12 pages.
Mathieu Desbrun and Marie-Paule Gascuel. 1996. Smoothed particles: A new paradigm for animating highly deformable bodies. In Computer Animation and Simulation’96. Springer, 61–76.
Zachary DeVito, Niels Joubert, Francisco Palacios, Stephen Oakley, Montserrat Medina, Mike Barrientos, Erich Elsen, Frank Ham, Alex Aiken, Karthik Duraisamy, and others. 2011. Liszt: A domain specific language for building portable mesh-based PDE solvers. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’11). ACM, New York, NY, 9.
Sheng Di, Yves Robert, Frédéric Vivien, Derrick Kondo, Cho-Li Wang, and Franck Cappello. 2013. Optimization of cloud task processing with checkpoint-restart mechanism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’13). IEEE, Los Alamitos, CA, 1–12.
J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. 2009. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage, and Analysis (SC’09). ACM, New York, NY, Article 53, 11 pages.
Jens Dittrich and Jorge-Arnulfo Quiané-Ruiz. 2012. Efficient big data processing in Hadoop MapReduce. Proc. VLDB Endow. 5, 12, 2014–2015.
Pradeep Dubey, Pat Hanrahan, Ronald Fedkiw, Michael Lentine, and Craig Schroeder. 2011. PhysBAM: Physically based simulation. In Proceedings of the ACM SIGGRAPH 2011 Courses. ACM, New York, NY, 10.
R. Elliot English, Linhai Qiu, Yue Yu, and Ronald Fedkiw. 2013. Chimera grids for water simulation. In Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation. ACM, New York, NY, 85–94.
Douglas Enright, Ronald Fedkiw, Joel Ferziger, and Ian Mitchell. 2002a. A hybrid particle level set method for improved interface capturing. J. Comput. Phys. 183, 1, 83–116.
Douglas Enright, Stephen Marschner, and Ronald Fedkiw. 2002b. Animation and rendering of complex water surfaces. In Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’02). ACM, New York, NY, 736–744.
Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, et al. 2006. Sequoia: Programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. ACM, New York, NY, 83.
J. Davison de St Germain, John McCorquodale, Steven G. Parker, and Christopher R. Johnson. 2000. Uintah: A massively parallel problem solving environment. In Proceedings of the 9th International Symposium on High-Performance Distributed Computing. IEEE, Los Alamitos, CA, 33–41.
Sanjay Ghemawat and Jeff Dean. 2017. LevelDB. Retrieved April 3, 2018, from https://github.com/google/leveldb.
Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). 17–30. http://dl.acm.org/citation.cfm?id=2387880.2387883
Nolan Goodnight. 2007. CUDA/OpenGL Fluid Simulation. NVIDIA Corporation.
Pat Hanrahan and Jim Lawson. 1990. A language for shading and lighting calculations. ACM SIGGRAPH Computer Graphics 24, 289–298.
Paul H. Hargrove and Jason C. Duell. 2006. Berkeley lab checkpoint/restart (BLCR)for Linux clusters. J. Phys. Conf. Ser. 46, 494.
Francis H. Harlow. 1962. The Particle-in-Cell Method for Numerical Solution of Problems in Fluid Dynamics. Technical Report. Los Alamos Scientific Laboratory, New Mexico.
Christopher J. Hughes, Radek Grzeszczuk, Eftychios Sifakis, Daehyun Kim, Sanjeev Kumar, Andrew P. Selle, Jatin Chhugani, Matthew Holliman, and Yen-Kuang Chen. 2007. Physical simulation for animation and visual effects: Parallelization and characterization for chip multiprocessors. ACM SIGARCH Computer Architecture News 35, 220–231.
Greg Humphreys, Mike Houston, Ren Ng, Randall Frank, Sean Ahern, Peter D. Kirchner, and James T. Klosowski. 2002. Chromium: A stream-processing framework for interactive rendering on clusters. ACM Trans. Graphics 21, 3, 693–702.
Emmanuel Jeannot, Esteban Meneses, Guillaume Mercier, François Tessier, and Gengbin Zheng. 2013. Communication and topology-aware load balancing in Charm++ with TreeMatch. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’13). 1–8.
Chenfanfu Jiang, Craig Schroeder, Andrew Selle, Joseph Teran, and Alexey Stomakhin. 2015. The affine particle-in-cell method. ACM Trans. Graphics 34, 4, 51.
Laxmikant V. Kale and Sanjeev Krishnan. 1993. CHARM++: A Portable Concurrent Object Oriented System Based on C++. Vol. 28. ACM, New York, NY.
Shoaib Kamil. 2017. StencilProbe: A Microbenchmark for Stencil Applications. Retrieved April 3, 2018, from http://people.csail.mit.edu/skamil/projects/stencilprobe/.
George Karypis and Vipin Kumar. 1996. Parallel multilevel K-way partitioning scheme for irregular graphs. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing (SC’96). IEEE, Los Alamitos, CA, Article 35.
Fredrik Kjolstad, Shoaib Kamil, Jonathan Ragan-Kelley, David I. W. Levin, Shinjiro Sueda, Desai Chen, Etienne Vouga, et al. 2016. Simit: A language for physical simulation. ACM Trans. Graphics 35, 2, 20.
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. 2009. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 609–616.
Jonathan Lifflander, Sriram Krishnamoorthy, and Laxmikant V. Kale. 2012. Work stealing and persistence-based load balancers for iterative overdecomposed applications. In Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing (HPDC’12). ACM, New York, NY, 137–148.
Frank Losasso, Frédéric Gibou, and Ron Fedkiw. 2004. Simulating water and smoke with an octree data structure. ACM Trans. Graphics 23, 457–462.
David Luebke. 2008. CUDA: Scalable parallel programming for high-performance scientific computing. In Proceedings of the 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro. IEEE, Los Alamitos, CA, 836–838.
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 135–146.
Omid Mashayekhi, Hang Qu, Chnimayee Shah, and Philip Levis. 2017. Execution templates: Caching control plane decisions for strong scaling of data analytics. In Proceedings of the 2017 USENIX Annual Technical Conference (ATC’17). 513–526. https://www.usenix.org/conference/atc17/technical-sessions/presentation/mashayekhi.
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A timely dataflow system. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). ACM, New York, NY, 439–455.
Ken Museth, Jeff Lait, John Johanson, Jeff Budsberg, Ron Henderson, Mihai Alden, Peter Cucka, David Hill, and Andrew Pearce. 2013. OpenVDB: An open-source data structure and toolkit for high-resolution volumes. In Proceedings of the ACM SIGGRAPH 2013 Courses. ACM, New York, NY, 19.
Lionel M. Ni and Kai Hwang. 1985. Optimal load balancing in a multiple processor system with many job classes. IEEE Trans. Softw. Eng. 11, 5, 491–496.
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, and VMware ICSI. 2015. Making sense of performance in data analytics frameworks. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI’15). 293–307.
Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: Distributed, low latency scheduling. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). ACM, New York, NY, 69–84.
Saket Patkar, Mridul Aanjaneya, Dmitriy Karpman, and Ronald Fedkiw. 2013. A hybrid Lagrangian-Eulerian formulation for bubble generation and dynamics. In Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation. ACM, New York, NY, 105–114.
Joshua Peraza, Ananta Tiwari, Michael Laurenzano, Laura Carrington, William A. Ward, and Roy Campbell. 2013. Understanding the performance of stencil computations on Intel’s Xeon Phi. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’13). IEEE, Los Alamitos, CA, 1–5.
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48, 6, 519–530.
Jon Reisch, Stephen Marshall, Magnus Wrenninge, Tolga Göktekin, Michael Hall, Michael O’Brien, Jason Johnston, Jordan Rempel, and Andy Lin. 2016. Simulating rivers in the good dinosaur. In Proceedings of the ACM SIGGRAPH 2016 Talks. ACM, New York, NY, 40.
Gabriel Rivera and Chau-Wen Tseng. 2000. Tiling optimizations for 3D scientific computations. In Proceedings of the ACM/IEEE 2000 Conference on Supercomputing. IEEE, Los Alamitos, CA, 32–32.
Elliott Slaughter, Wonchan Lee, Sean Treichler, Michael Bauer, and Alex Aiken. 2015. Regent: A high-productivity programming language for HPC with logical regions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’15). ACM, New York, NY, 81.
Marc Snir. 1998. MPI—The Complete Reference: The MPI Core. Vol. 1. MIT Press, Cambridge, MA.
Jos Stam. 1999. Stable fluids. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. ACM, New York, NY, 121–128.
Matt Stanton, Ben Humberston, Brandon Kase, James F. O’Brien, Kayvon Fatahalian, and Adrien Treuille. 2014. Self-refining games using player analytics. ACM Trans. Graphics 33, 4, 73.
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, New York, NY, 117–128.
The Khronos Group. 2017a. OpenCL Overview. Retrieved April 3, 2018, from https://www.khronos.org/opencl/.
The Khronos Group. 2017b. OpenGL. Retrieved April 3, 2018, from https://www.opengl.org/.
Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC’10). ACM, New York, NY, 193–204.
William W. White. 2012. River Running Through It. Retrieved April 3, 2018, from https://www.cs.siue.edu/&sim;wwhite/SIGGRAPH/SIGGRAPH2012Itinerary.pdf.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and Katherine Yelick. 2006. The potential of the cell processor for scientific computing. In Proceedings of the 3rd Conference on Computing Frontiers. ACM, New York, NY, 9–20.
John W. Young. 1974. A first order approximation to the optimum checkpoint interval. Commun. ACM 17, 9, 530–531.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’16). 2.
Gengbin Zheng, Abhinav Bhatelé, Esteban Meneses, and Laxmikant V. Kalé. 2011. Periodic hierarchical load balancing for large supercomputers. Int. J. High Perform. Comput. Appl. 25, 4, 371–385.
Yongning Zhu and Robert Bridson. 2005. Animating sand as a fluid. ACM Trans. Graphics 24, 965–972.