“The sketchy database: learning to retrieve badly drawn bunnies”

  • ©




    The sketchy database: learning to retrieve badly drawn bunnies


    We present the Sketchy database, the first large-scale collection of sketch-photo pairs. We ask crowd workers to sketch particular photographic objects sampled from 125 categories and acquire 75,471 sketches of 12,500 objects. The Sketchy database gives us fine-grained associations between particular photos and sketches, and we use this to train cross-domain convolutional networks which embed sketches and photographs in a common feature space. We use our database as a benchmark for fine-grained retrieval and show that our learned representation significantly outperforms both hand-crafted features as well as deep features trained for sketch or photo classification. Beyond image retrieval, we believe the Sketchy database opens up new opportunities for sketch and image understanding and synthesis.


    1. Antol, S., Zitnick, C. L., and Parikh, D. 2014. Zero-Shot Learning via Visual Abstraction. In ECCV.Google Scholar
    2. Bansal, A., Kowdle, A., Parikh, D., Gallagher, A., and Zitnick, L. 2013. Which edges matter? In Computer Vision Workshops (ICCVW), 2013 IEEE International Conference on, 578–585. Google ScholarDigital Library
    3. Bell, S., and Bala, K. 2015. Learning visual similarity for product design with convolutional neural networks. ACM Trans. Graph. 34, 4 (July). Google ScholarDigital Library
    4. Berger, I., Shamir, A., Mahler, M., Carter, E., and Hodgins, J. 2013. Style and abstraction in portrait sketching. ACM Trans. Graph. 32, 4 (July), 55:1–55:12. Google ScholarDigital Library
    5. Brady, T. F., Konkle, T., Alvarez, G. A., and Oliva, A. 2008. Visual long-term memory has a massive storage capacity for object details. Proceedings of the National Academy of Sciences 105, 38, 14325–14329.Google ScholarCross Ref
    6. Brady, T. F., Konkle, T., Gill, J., Oliva, A., and Alvarez, G. A. 2013. Visual long-term memory has the same limit on fidelity as visual working memory. Psychological Science 24, 6.Google ScholarCross Ref
    7. Cao, Y., Wang, C., Zhang, L., and Zhang, L. 2011. Edgel index for large-scale sketch-based image search. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 761–768. Google ScholarDigital Library
    8. Cao, X., Zhang, H., Liu, S., Guo, X., and Lin, L. 2013. Sym-fish: A symmetry-aware flip invariant sketch histogram shape descriptor. In Computer Vision (ICCV), 2013 IEEE International Conference on, 313–320. Google ScholarDigital Library
    9. Chen, T., ming Cheng, M., Tan, P., Shamir, A., and min Hu, S. 2009. Sketch2photo: internet image montage. ACM SIGGRAPH Asia. Google ScholarDigital Library
    10. Chen, T., Tan, P., Ma, L.-Q., Cheng, M.-M., Shamir, A., and Hu, S.-M. 2013. Poseshop: Human image database construction and personalized content synthesis. IEEE Transactions on Visualization and Computer Graphics 19, 5 (May), 824–837. Google ScholarDigital Library
    11. Chopra, S., Hadsell, R., and LeCun, Y. 2005. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, 539–546. Google ScholarDigital Library
    12. Cole, F., Golovinskiy, A., Limpaecher, A., Barros, H. S., Finkelstein, A., Funkhouser, T., and Rusinkiewicz, S. 2008. Where do people draw lines? ACM Transactions on Graphics (Proc. SIGGRAPH) 27, 3 (Aug.). Google ScholarDigital Library
    13. Del Bimbo, A., and Pala, P. 1997. Visual image retrieval by elastic matching of user sketches. Pattern Analysis and Machine Intelligence, IEEE Transactions on 19, 2 (Feb), 121–132. Google ScholarDigital Library
    14. Dosovitskiy, A., Springenberg, J. T., and Brox, T. 2014. Learning to generate chairs with convolutional neural networks. CoRR abs/1411.5928.Google Scholar
    15. Eitz, M., Hildebrand, K., Boubekeur, T., and Alexa, M. 2010. An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Computers & Graphics 34, 5, 482–498. Google ScholarDigital Library
    16. Eitz, M., Hildebrand, K., Boubekeur, T., and Alexa, M. 2011. Sketch-based image retrieval: Benchmark and bag-of-features descriptors. IEEE Transactions on Visualization and Computer Graphics 17, 11, 1624–1636. Google ScholarDigital Library
    17. Eitz, M., Richter, R., Hildebrand, K., Boubekeur, T., and Alexa, M. 2011. Photosketcher: interactive sketch-based image synthesis. IEEE Computer Graphics and Applications. Google ScholarDigital Library
    18. Eitz, M., Hays, J., and Alexa, M. 2012. How do humans sketch objects? ACM Trans. Graph. (Proc. SIGGRAPH) 31, 4, 44:1–44:10. Google ScholarDigital Library
    19. Eitz, M., Richter, R., Boubekeur, T., Hildebrand, K., and Alexa, M. 2012. Sketch-based shape retrieval. ACM Transactions on Graphics (Proceedings SIGGRAPH) 31, 4, 31:1–31:10. Google ScholarDigital Library
    20. Everingham, M., Gool, L., Williams, C. K., Winn, J., and Zisserman, A. 2010. The pascal visual object classes (voc) challenge. Int. J. Comput. Vision 88, 2 (June), 303–338. Google ScholarDigital Library
    21. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and Ramanan, D. 2010. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 9 (Sept.), 1627–1645. Google ScholarDigital Library
    22. Grill-Spector, K., and Kanwisher, N. 2005. Visual recognition: as soon as you see it, you know what it is. Psychological Science 16, 2, 152–160.Google ScholarCross Ref
    23. Hadsell, R., Chopra, S., and LeCun, Y. 2006. Dimensionality reduction by learning an invariant mapping. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2, 1735–1742. Google ScholarDigital Library
    24. Han, X., Leung, T., Jia, Y., Sukthankar, R., and Berg, A. 2015. Matchnet: Unifying feature and metric learning for patch-based matching. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, 3279–3286.Google Scholar
    25. Hu, R., and Collomosse, J. 2013. A performance evaluation of gradient field hog descriptor for sketch based image retrieval. Computer Vision and Image Understanding 117, 7, 790–806. Google ScholarDigital Library
    26. Jacobs, C. E., Finkelstein, A., and Salesin, D. H. 1995. Fast multiresolution image querying. In Proceedings of the 22Nd Annual Conference on Computer Graphics and Interactive Techniques, ACM, SIGGRAPH ’95, 277–286. Google ScholarDigital Library
    27. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.Google Scholar
    28. Jun, X., Aaron, H., Wilmot, L., and Holger, W. 2014. Portraitsketch: Face sketching assistance for novices. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology, ACM. Google ScholarDigital Library
    29. Kato, T., Kurita, T., Otsu, N., and Hirata, K. 1992. A sketch retrieval method for full color image database-query by visual example. In Pattern Recognition, 1992. Vol. I. Conference A: Computer Vision and Applications, Proceedings., 11th IAPR International Conference on, 530–533.Google Scholar
    30. Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In 26th Annual Conference on Neural Information Processing Systems (NIPS), 1106–1114.Google Scholar
    31. Lee, D., and Chun, M. M. What are the units of visual short-term memory, objects or spatial locations? Perception & Psychophysics 63, 2, 253–257.Google Scholar
    32. Li, Y., Hospedales, T. M., Song, Y.-Z., and Gong, S. 2014. Fine-grained sketch-based image retrieval by matching deformable part models. In British Machine Vision Conference (BMVC).Google Scholar
    33. Li, Y., Su, H., Qi, C. R., Fish, N., Cohen-Or, D., and Guibas, L. J. 2015. Joint embeddings of shapes and images via cnn image purification. ACM Trans. Graph. 34, 6 (Oct.), 234:1–234:12. Google ScholarDigital Library
    34. Limpaecher, A., Feltman, N., Treuille, A., and Cohen, M. 2013. Real-time drawing assistance through crowdsourcing. ACM Trans. Graph. 32, 4 (July), 54:1–54:8. Google ScholarDigital Library
    35. Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. 2014. Microsoft COCO: common objects in context. CoRR abs/1405.0312.Google Scholar
    36. Lin, T.-Y., Cui, Y., Belongie, S., and Hays, J. 2015. Learning deep representations for ground-to-aerial geolocalization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
    37. Mainelli, T., Chau, M., Reith, R., and Shirer, M., 2015. Idc worldwide quarterly smart connected device tracker. http://www.idc.com/getdoc.jsp?containerId=prUS25500515, March 20, 2015.Google Scholar
    38. Martin, D., Fowlkes, C., Tal, D., and Malik, J. 2001. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, vol. 2, 416–423.Google ScholarCross Ref
    39. Nieuwenstein, M., and Wyble, B. 2014. Beyond a mask and against the bottleneck: Retroactive dual-task interference during working memory consolidation of a masked visual target. Journal of Experimental Psychology: General 143, 1409–1427.Google ScholarCross Ref
    40. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3, 211–252. Google ScholarDigital Library
    41. Saavedra, J. M., and Barrios, J. M. 2015. Sketch based image retrieval using learned keyshapes (lks). In Proceedings of the British Machine Vision Conference (BMVC), 164.1–164.11.Google Scholar
    42. Schneider, R. G., and Tuytelaars, T. 2014. Sketch classification and classification-driven analysis using fisher vectors. ACM Trans. Graph. 33, 6 (Nov.), 174:1–174:9. Google ScholarDigital Library
    43. Sclaroff, S. 1997. Deformable prototypes for encoding shape categories in image databases. Pattern Recognition 30, 4, 627–641.Google ScholarCross Ref
    44. Shrivastava, A., Malisiewicz, T., Gupta, A., and Efros, A. A. 2011. Data-driven visual similarity for cross-domain image matching. In ACM Transactions on Graphics (TOG), vol. 30, ACM, 154. Google ScholarDigital Library
    45. Smeulders, A., Worring, M., Santini, S., Gupta, A., and Jain, R. 2000. Content-based image retrieval at the end of the early years. Pattern Analysis and Machine Intelligence, IEEE Transactions on 22, 12 (Dec), 1349–1380. Google ScholarDigital Library
    46. Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E. G. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proc. ICCV. Google ScholarDigital Library
    47. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. 2014. Going deeper with convolutions. arXiv preprint arXiv:1409.4842.Google Scholar
    48. Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. 2014. Deepface: Closing the gap to human-level performance in face verification. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, 1701–1708. Google ScholarDigital Library
    49. van der Maaten, L., and Hinton, G. 2008. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research 9, 3 (Nov.), 2579–2605.Google Scholar
    50. Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., and Wu, Y. 2014. Learning fine-grained image similarity with deep ranking. CoRR abs/1404.4661. Google ScholarDigital Library
    51. Wang, F., Kang, L., and Li, Y. 2015. Sketch-based 3d shape retrieval using convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
    52. Xiao, J., Ehinger, K. A., Hays, J., Torralba, A., and Oliva, A. 2014. Sun database: Exploring a large collection of scene categories. International Journal of Computer Vision, 1–20. Google ScholarDigital Library
    53. Yu, Q., Yang, Y., Song, Y.-Z., Xiang, T., and Hospedales, T. 2015. Sketch-a-net that beats humans. In British Machine Vision Conference (BMVC).Google Scholar
    54. Yu, Q., Liu, F., Song, Y., Xiang, T., Hospedales, T., and Loy, C. C. 2016. Sketch me that shoe. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
    55. Zeiler, M. D., and Fergus, R. 2014. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014. Springer, 818–833.Google Scholar
    56. Zhou, T., Jae Lee, Y., Yu, S. X., and Efros, A. A. 2015. Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
    57. Zhu, J.-Y., Lee, Y. J., and Efros, A. A. 2014. Averageexplorer: Interactive exploration and alignment of visual data collections. ACM Transactions on Graphics (SIGGRAPH 2014) 33, 4. Google ScholarDigital Library

ACM Digital Library Publication: