Systematic Review of Guiding Theories for Visual Transformers

Abstract


Transformers have emerged as a powerful architecture in Artificial Intelligence, revolutionizing various Natural Language Processing and image processing tasks. This paper presents a comprehensive analysis of the historical evolution of Transformers, emphasizing their self-attention mechanism and culminating with the introduction of the novel Visual Transformers model. We explore the main contributions made by key works in the field leading up to the development of Visual Transformers. Conducting a systematic analysis helps in gaining a deeper understanding of the functioning of this model and identifying the key topics for each approach.

Keywords: selfattention, transformers, nlp, visual

References

Abdi, H., Valentin, D., and Edelman, B. (1999). Neural networks. Sage. https://doi.org/10.4135/9781412985277

Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., and Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. https://doi.org/10.48550/arXiv.1901.02860

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://doi.org/10.48550/arXiv.1810.04805

Di Gangi, M. A., Negri, M., and Turchi, M. (2019). Adapting transformer to end-to-end spoken language translation. In Proceedings of INTERSPEECH 2019, pages 1133-1137. International Speech Communication Association (ISCA). https://doi.org/10.48550/arXiv.2106.04833

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. https://doi.org/10.48550/arXiv.2010.11929

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82-97. https://doi.org/10.1109/MSP.2012.2205597

Jing, Y., Yang, Y., Feng, Z., Ye, J., Yu, Y., and Song, M. (2019). Neural style transfer: A review. IEEE Transactions on Visualization and Computer Graphics, 26(11):3365-3385. https://doi.org/10.1109/TVCG.2019.2921336

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324. https://doi.org/10.1109/5.726791

Liu, P.-r., Lu, L., Zhang, J.-y., Huo, T.-t., Liu, S.-x., and Ye, Z.-w. (2021). Application of artificial intelligence in medicine: an overview. Current Medical Science, 41(6):1105-1115. https://doi.org/10.1007/s11596-021-2474-3

Otter, D. W., Medina, J. R., and Kalita, J. K. (2020). A survey of the usages of deep learning for natural language processing. IEEE transactions on neural networks and learning systems, 32(2):604-624. https://doi.org/10.1109/tnnls.2020.2979670

Ouyang, F. and Jiao, P. (2021). Artificial intelligence in education: The three paradigms. Computers and Education: Artificial Intelligence, 2:100020. https://doi.org/10.1016/j.caeai.2021.100020

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088):533-536. https://doi.org/10.1038/

Russell, S. J. (2010). Artificial intelligence a modern approach. Pearson Education, Inc.

Valenzuela, O., Catala, A., Anguita, D., and Rojas, I. (2023). New advances in artificial neural networks and machine learning techniques. Neural Processing Letters, pages 1-4. https://doi.org/10.1007/s11063-023-11350-w

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. https://doi.org/10.48550/arXiv.1706.03762

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32. https://doi.org/10.48550/arXiv.1906.08237

Zhang, J. and Man, K.-F. (1998). Time series prediction using rnn in multi-dimension embedding phase space. In SMC’98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No. 98CH36218), volume 2, pages 1868-1873. IEEE. https://doi.org/10.1109/ICSMC.1998.728168
Published
2023-10-09
S. JUNIOR, Joelson; LUCCA, Giancarlo; BOTTERO, Diego; DIMURO, Graçaliz P.; SANTOS, Helida. Systematic Review of Guiding Theories for Visual Transformers. In: WORKSHOP-SCHOOL ON THEORETICAL COMPUTER SCIENCE (WEIT), 7. , 2023, Rio Grande/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 87-94. DOI: https://doi.org/10.5753/weit.2023.26601.