Spider Species Classification Using Vision Transformers and Convolutional Neural Networks
Resumo
Spiders often seek shelter in the heat and safety of homes and although most of them are harmless, some can represent a real danger. Since differentiating spider species can be a challenge for individuals without prior knowledge, having a method to identify them could be useful in order to avoid potentially venomous ones. To address this question, this project aimed to analyze and compare the performance of convolutional neural networks (CNN) and vision transformers (ViT) regarding the quantitative and qualitative performance in the task of classifying different species of spiders from their images. We utilized publicly available images consisting of 25 Brazilian spider species and around 25,000 images. We selected the models based on their metrics and generalization performance in this classification task. The preliminary results indicated that ConvNeXt emerged as the most proficient among the examined Convolutional Neural Networks, achieving a macro accuracy of 88.5%. As for the Vision Transformers, MaxViT surpassed its counterparts, registering a macro accuracy of 90.1%, and outperformed the models in a direct comparison of their performance metrics. These results may contribute to the development of applications aimed at identifying spiders and providing information of interest about the species.
Referências
Secretaria de Vigilância em Saúde, “Panorama dos acidentes causados por aranhas no brasil, de 2017 a 2021,” 2022, [link].
Y. S. Abu-Mostafa, H.-T. Lin, and M. Magdon-Ismail, Learning From Data. AMLBook, 2012.
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, [link].
Z. Li, W. Yang, S. Peng, and F. Liu, “A survey of convolutional neural networks: Analysis, applications, and prospects,” 2020.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR. OpenReview.net, 2021.
Y. Jian, S. Peng, L. Zhenpeng, Z. Yu, Z. Chenggui, and Y. Zizhong, “Automatic classification of spider images in natural background,” in 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), 2019, pp. 158–164.
R. O. Sinnott, D. Yang, X. Ding, and Z. Ye, “Poisonous spider recognition through deep learning,” in Proceedings of the Australasian Computer Science Week Multiconference, ser. ACSW ’20. New York, NY, USA: Association for Computing Machinery, 2020. [Online]. Available: https://doi.org/10.1145/3373017.3373031
Q. Chen, Y. Ding, C. Liu, J. Liu, and T. He, “Research on spider sex recognition from images based on deep learning,” IEEE Access, vol. 9, pp. 120 985–120 995, 2021.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015.
S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” CoRR, vol. abs/1611.05431, 2016. [Online]. Available: [link]
Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” 2022.
Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, F. Wei, and B. Guo, “Swin transformer v2: Scaling up capacity and resolution,” 2022.
Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li, “Maxvit: Multi-axis vision transformer,” 2022.
J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” CoRR, vol. abs/1411.1792, 2014. [Online]. Available: [link]
J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” Journal of Machine Learning Research, 2012.