Fusing Scene Context to Improve Object Recognition

Authors

  • Leandro P. da Silva Pontifícia Universidade Católica do Rio Grande do Sul
  • Roger Granada Pontifícia Universidade Católica do Rio Grande do Sul
  • Juarez Monteiro Pontifícia Universidade Católica do Rio Grande do Sul
  • Duncan D. Ruiz Pontifícia Universidade Católica do Rio Grande do Sul

DOI:

https://doi.org/10.5753/jidm.2018.2050

Keywords:

convolutional neural networks, neural networks, object recognition

Abstract

Computer vision is a branch of science that seeks to give computers the capability of seeing the world around them. Among its tasks, object recognition aims to classify objects and to identify where each object is in a given image. As objects tend to occur in particular environments, their contextual association can be useful for improving the object recognition task. To address the contextual awareness in object recognition tasks, our approach aims to use the context of the scenes in order to achieve higher quality in object recognition, by fusing context information with object
detection features. Hence, we propose a novel architecture composed of two convolutional neural networks based on two well-known pre-trained nets: Places365-CNN and Faster R-CNN. Our two-streams architecture uses the concatenation of object features with scene context features in a late fusion approach. We performed experiments using public datasets (PASCAL VOC 2007, MS COCO and a subset of SUN09) analyzing the performance of our architecture with different threshold scores. Results show that our approach is able to raise in-context object scores, and reduces out-of-context objects scores.

Downloads

Download data is not yet available.

References

Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. Speeded-up robust features (surf). Computer Vision and Image Understanding 110 (3): 346–359, 2008.

Bell, S., Lawrence Zitnick, C., Bala, K., and Girshick, R. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR’16. pp. 2874–2883, 2016.

Biederman, I., Mezzanotte, R. J., and Rabinowitz, J. C. Scene perception: Detecting and judging objects undergoing relational violations. Cognitive Psychology 14 (2): 143–177, 1982.

Chu, W. and Cai, D. Deep feature based contextual model for object detection. Neurocomputing 275 (31): 1035–1042, 2017.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR’09. pp. 248–255, 2009.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2): 303–338, 2010.

Galleguillos, C. and Belongie, S. Context based object categorization: A critical survey. Computer Vision and Image Understanding 114 (6), 2010. Girshick, R. Fast r-cnn. In ICCV’15. pp. 1440–1448, 2015.

Guo, S., Huang, W., Wang, L., and Qiao, Y. Locally supervised deep hybrid model for scene recognition. IEEE Transactions on Image Processing 26 (2): 808–820, 2017.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation 9 (8): 1735–1780, 1997.

Jin, C. M., Lim, J. J., Torralba, A., and Willsky, A. S. Exploiting hierarchical context on a large database of object categories. In CVPR’10. pp. 129–136, 2010.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NIPS’12. pp. 1097–1105, 2012.

Lazebnik, S., Schmid, C., and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR’16. pp. 2169–2178, 2016.

LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature 521 (7553): 436–444, 2015.

Li, J., Wei, Y., Liang, X., Dong, J., Xu, T., Feng, J., and Yan, S. Attentive contexts for object detection. IEEE Transactions on Multimedia 19 (5): 944–954, 2017.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In ECCV’14. pp. 740–755, 2014.

Liu, J., Gao, C., Meng, D., and Zuo, W. Two-stream contextualized cnn for fine-grained image classification. In AAAI’16. pp. 4232–4233, 2016.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A. C. Ssd: Single shot multibox detector. In ECCV’16. pp. 21–37, 2016.

Lowe, D. G. Object recognition from local scale-invariant features. In ICCV’99. pp. 1150–1157, 1999.

Oliva, A. and Torralba, A. The role of context in object recognition. Trends in Cognitive Sciences 11 (12): 520–527, 2007.

Ouyang, W., Wang, X., Zeng, X., Qiu, S., et al. Deepid-net: Deformable deep convolutional neural networks for object detection. In CVPR’15. pp. 2403–2412, 2015.

Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10): 15, 2010.

Perronnin, F. Universal and adapted vocabularies for generic visual categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (7): 1243–125, 2008.

Quattoni, A. and Torralba, A. Recognizing indoor scenes. In CVPR’09. pp. 413–420, 2009.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. You only look once: Unified, real-time object detection. In CVPR’16. pp. 779–788, 2016.

Redmon, J. and Farhadi, A. Yolo9000: Better, faster, stronger. In CVPR’17. pp. 6517–6525, 2017.

Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS’15. pp. 91–99, 2015.

Simonyan, K. and Zisserman, A. Two-stream convolutional networks for action recognition in videos. In NIPS’14. pp. 568–576, 2014a.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014b.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In CVPR’15. pp. 1–9, 2015.

Wang, L., Guo, S., Huang, W., Xiong, Y., and Qiao, Y. Knowledge guided disambiguation for large-scale scene classification with multi-resolution cnns. IEEE Transactions on Image Processing 26 (4): 2055–2068, 2017.

Wang, Z., Wang, L., Wang, Y., Zhang, B., and Qiao, Y. Weakly supervised patchnets: Describing and aggregating local patches for scene recognition. IEEE Transactions on Image Processing 26 (4): 2028–2041, 2017.

Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba, A. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR’10. pp. 3485–3492, 2010.

Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In ECCV’14. pp. 818–833, 2014.

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Torralba, A. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence PP (99): 1–14, 2017.

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A. Learning deep features for scene recognition using places database. In NIPS’14. pp. 487–495, 2014.

Downloads

Published

2018-10-01

How to Cite

P. da Silva, L., Granada, R., Monteiro, J., & D. Ruiz, D. (2018). Fusing Scene Context to Improve Object Recognition. Journal of Information and Data Management, 9(2), 147. https://doi.org/10.5753/jidm.2018.2050

Issue

Section

KDMILE 2017