    In this talk we present MisterD, Pixar’s render farm monitoring system. Metrics for networking, filesystem I/O, and rendering nodes, as well as the entirety of the batch rendering job queue, are stored in a relational database. A salient feature of the system is a powerful, custom query language which enables users to succinctly express complex relational queries without certain encumbrances required in a SQL syntax. We show how a number of common hardware and software failures can be efficiently detected and addressed, as well as how some uncommon pathologies have been discovered. The system has been designed to scale to very large render farms, and currently supports a farm with tens of thousands of cores. The same techniques are equally valid in smaller farms.


