Classificação multimodal para detecção de produtos proibidos em uma plataforma marketplace

Alan da Silva Romualdo; Livy Real; Helena de Medeiros Caseli

doi:10.5753/stil.2021.17790

Alan da Silva Romualdo UFSCar
Livy Real Americanas S. A.
Helena de Medeiros Caseli UFSCar

DOI: https://doi.org/10.5753/stil.2021.17790

Resumo

O aprendizado multimodal visa explorar as características das diversas modalidades (texto, imagem, áudio) para gerar modelos computacionais. No comércio eletrônico, devido à grande variedade das características dos produtos e à ausência ou inconsistência de informações, a combinação de informações de modos diferentes vem a ser bastante adequada. Neste trabalho são apresentados alguns experimentos para a classificação multimodal (texto e imagem) de produtos (produtos adultos) que não podem ser vendidos no marketplace da empresa parceira. Nesses experimentos, redes neurais foram usadas para treinar classificadores uni e multimodal. O classificador multimodal atingiu 99% de F1 contra 98% do modelo textual e 94% do visual.

Referências

Bi, Y., Wang, S., and Fan, Z. (2020). A multimodal late fusion model for e-commerce product classification. Proceedings of The 2020 SIGIR Workshop On eCommerce, abs/2008.06179.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5:135–146.

Chollet, F. et al. (2015). Keras. https://keras.io.

Chordia, V. and Kumar, V. (2020). Large scale multimodal classification using an ensemble of transformer models and co-attention. CoRR, abs/2011.11735.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Hartmann, N. S., Fonseca, E. R., Shulby, C. D., Treviso, M. V., Rodrigues, J. S., and Aluísio, S. M. (2017). Portuguese word embeddings: Evaluating on word analoIn Anais do XI Simpósio Brasileiro de Tecnologia gies and natural language tasks. da Informação e da Linguagem Humana, pages 122–131, Porto Alegre, RS, Brasil. SBC.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.

Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar. Association for Computational Linguistics.

Krizhevsky, A. (2012). Learning multiple layers of features from tiny images. University of Toronto.

LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database.

Lu, J., Yang, J., Batra, D., and Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 289–297, Red Hook, NY, USA. Curran Associates Inc.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems Volume 2, NIPS’13, page 3111–3119, Red Hook, NY, USA. Curran Associates Inc.

Peng, Y., Qi, J., and Yuan, Y. (2018). Modality-specific cross-modal similarity measurement with recurrent attention network. Trans. Img. Proc., 27(11):5585–5599.

Pennington, J., Socher, R., and Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.

Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Bengio, Y. and LeCun, Y., editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.

Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. (2017). Inception-v4, inceptionresnet and the impact of residual connections on learning. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, AAAI’17, page 4278–4284. AAAI Press.

Wirojwatanakul, P. and Wangperawong, A. (2019). Multi-label product categorization using multi-modal fusion models. CoRR, abs/1907.00420.

Zahavy, T., Krishnan, A., Magnani, A., and Mannor, S. (2018). Is a picture worth a thousand words? a deep multi-modal architecture for product classification in e-commerce. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).