“Rigel: flexible multi-rate image processing hardware”

  • ©James Hegarty, Zachary DeVito, Jonathan Ragan-Kelley, Patrick (Pat) Hanrahan, Ross Daly, and Mark Horowitz

Conference:


Type:


Title:

    Rigel: flexible multi-rate image processing hardware

Session/Category Title: OPTIMIZING IMAGE PROCESSING


Presenter(s)/Author(s):


Moderator(s):



Abstract:


    Image processing algorithms implemented using custom hardware or FPGAs of can be orders-of-magnitude more energy efficient and performant than software. Unfortunately, converting an algorithm by hand to a hardware description language suitable for compilation on these platforms is frequently too time consuming to be practical. Recent work on hardware synthesis of high-level image processing languages demonstrated that a single-rate pipeline of stencil kernels can be synthesized into hardware with provably minimal buffering. Unfortunately, few advanced image processing or vision algorithms fit into this highly-restricted programming model.In this paper, we present Rigel, which takes pipelines specified in our new multi-rate architecture and lowers them to FPGA implementations. Our flexible multi-rate architecture supports pyramid image processing, sparse computations, and space-time implementation tradeoffs. We demonstrate depth from stereo, Lucas-Kanade, the SIFT descriptor, and a Gaussian pyramid running on two FPGA boards. Our system can synthesize hardware for FPGAs with up to 436 Megapixels/second throughput, and up to 297x faster runtime than a tablet-class ARM CPU.

References:


    1. Adams, A., Talvala, E.-V., Park, S. H., Jacobs, D. E., Ajdin, B., Gelfand, N., Dolson, J., Vaquero, D., Baek, J., Tico, M., Lensch, H. P. A., Matusik, W., Pulli, K., Horowitz, M., and Levoy, M. 2010. The Frankencamera: An experimental platform for computational photography. ACM Transactions on Graphics 29, 4 (July), 29:1–29:12. Google ScholarDigital Library
    2. Adelson, E. H., Anderson, C. H., Bergen, J. R., Burt, P. J., and Ogden, J. M. 1984. Pyramid methods in image processing. RCA engineer 29, 6, 33–41.Google Scholar
    3. Bilsen, G., Engels, M., Lauwereins, R., and Peper-straete, J. 1995. Cyclo-static data flow. In 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 5, 3255–3258.Google Scholar
    4. Bouguet, J.-Y. 2001. Pyramidal implementation of the affine Lucas Kanade feature tracker description of the algorithm. Tech. rep., Intel Corporation.Google Scholar
    5. Brunhaver, J. 2015. Design and Optimization of a Stencil Engine. PhD thesis, Stanford University.Google Scholar
    6. DeVito, Z., Hegarty, J., Aiken, A., Hanrahan, P., and Vitek, J. 2013. Terra: A multi-stage language for high-performance computing. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, 105–116. Google ScholarDigital Library
    7. Elinux, 2015. Jetson computer vision performance. http://elinux.org/Jetson/Computer_Vision_Performance. {Online; accessed 12-April-2016}.Google Scholar
    8. Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B. C., Richardson, S., Kozyrakis, C., and Horowitz, M. 2010. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ACM, 37–47. Google ScholarDigital Library
    9. Harris, C., and Stephens, M. 1988. A combined corner and edge detector. In Proceedings of the 4th Alvey Vision Conference, 147–151.Google Scholar
    10. Hegarty, J., Brunhaver, J., DeVito, Z., Ragan-Kelley, J., Cohen, N., Bell, S., Vasilyev, A., Horowitz, M., and Hanrahan, P. 2014. Darkroom: Compiling high-level image processing code into hardware pipelines. ACM Transactions on Graphics 33, 4 (July), 144:1–144:11. Google ScholarDigital Library
    11. Horstmannshoff, J., Grotker, T., and Meyr, H. 1997. Mapping multirate dataflow to complex rt level hardware models. In Application-Specific Systems, Architectures and Processors, 1997. Proceedings., IEEE International Conference on, 283–292. Google ScholarDigital Library
    12. Huang, J., Qian, F., Gerber, A., Mao, Z. M., Sen, S., and Spatscheck, O. 2012. A close examination of performance and power characteristics of 4g lte networks. In Proceedings of the 10th international conference on Mobile systems, applications, and services, ACM, 225–238. Google ScholarDigital Library
    13. Lee, E. A., and Messerschmitt, D. G. 1987. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers 100, 1, 24–35. Google ScholarDigital Library
    14. Leiserson, C. E., and Saxe, J. B. 1991. Retiming synchronous circuitry. Algorithmica 6, 1-6, 5–35.Google ScholarDigital Library
    15. Lowe, D. 1999. Object recognition from local scale-invariant features. In The Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, 1150–1157 vol. 2. Google ScholarDigital Library
    16. Lucas, B. D., Kanade, T., et al. 1981. An iterative image registration technique with an application to stereo vision. In International Joint Conference on Artificial Intelligence, vol. 81, 674–679. Google ScholarDigital Library
    17. Mullapudi, R. T., Adams, A., Sharlet, D., Ragan-Kelley, J., and Fatahalian, K. 2016. Automatically scheduling halide image processing pipelines. ACM Transactions on Graphics 35, 4 (July). Google ScholarDigital Library
    18. Murthy, P. K., and Lee, E. 2002. Multidimensional synchronous dataflow. IEEE Transactions on Signal Processing 50, 8, 2064–2079. Google ScholarDigital Library
    19. Murthy, P., Bhattacharyya, S., and Lee, E. 1997. Joint minimization of code and data for synchronous dataflow programs. Formal Methods in System Design 11, 1, 41–70. Google ScholarDigital Library
    20. Ragan-Kelley, J., Adams, A., Paris, S., Levoy, M., Amarasinghe, S., and Durand, F. 2012. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Transactions on Graphics 31, 4, 32. Google ScholarDigital Library
    21. Scharstein, D., and Szeliski, R. 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47, 1-3, 7–42. Google ScholarDigital Library
    22. Sugerman, J., Fatahalian, K., Boulos, S., Akeley, K., and Hanrahan, P. 2009. Gramps: A programming model for graphics pipelines. ACM Transactions on Graphics 28, 1 (Feb.), 4:1–4:11. Google ScholarDigital Library
    23. Vivado, 2016. Vivado high-level synthesis. http://www.xilinx.com/products/design-tools/vivado/integration/esl-design/. {Online; accessed 12-April-2016}.Google Scholar
    24. Xilinx. 2016. Zynq-7000 All Programmable SoC Overview. DS190 Rev. 1.9.Google Scholar
    25. Xilinx, 2016. Power efficiency. http://www.xilinx.com/products/technology/power.html. {Online; accessed 12-April-2016}.Google Scholar


ACM Digital Library Publication:



Overview Page: