Identification of DNA Coding Regions Using Transformers
Resumo
Identifying coding (exon) and non-coding (intron) regions in DNA sequences is fundamental to understanding gene expression and its implications for biological processes and genetic diseases. In this work, we investigate the application of Transformer-based architectures to the task of intron and exon classification, comparing three distinct models: GPT-2, BERT, and DNABERT. These models were selected to evaluate the impact of context modeling strategies—autoregressive, bidirectional, and k-mer-based—on genomic sequence analysis. Experiments were carried out on a curated dataset comprising 100000 training sequences and 30000 test sequences, using mutually exclusive samples to ensure robust evaluation. All models were fine-tuned under uniform conditions, with a fixed batch size of 32 and learning rate constraints, and executed three times with different seeds. The results show that BERT achieved the highest classification accuracy (0.9905), outperforming DNABERT (0.9569) and GPT-2 (0.9867). While DNABERT was the fastest to train due to its k-mer tokenization and lighter computational requirements, its limited capacity to model long-range dependencies impaired its performance. In contrast, GPT-2 demonstrated competitive accuracy but at a higher computational cost, reinforcing the trade-off between generative modeling power and efficiency. This study highlights the importance of context-aware attention mechanisms in genomic sequence modeling and confirms the viability of Transformer architectures—especially bidirectional models like BERT—for high-accuracy classification of intronic and exonic regions. Future work may benefit from exploring larger models, sequence representation alternatives, and training optimization techniques to further enhance performance in genomics applications.
Referências
Alberts, B., Johnson, A., Lewis, J., Morgan, D., Raff, M., Roberts, K., Walter, P., Wilson, J., and Hunt, T. Biologia Molecular da Célula. Vol. 6. Artmed, Porto Alegre, Brasil, 2017.
Chow, L. T., Roberts, J. M., Lewis, J. B., and Broker, T. R. A map of cytoplasmic rna transcripts from lytic adenovirus type 2, determined by electron microscopy of rna:dna hybrids. Cell 11 (4): 819–836, 1977.
Crick, F. H. C. On protein synthesis. Symposia of the Society for Experimental Biology vol. 12, pp. 138–163, 1958.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, pp. 4171–4186, 2019.
Du, X., Yao, Y., Diao, Y., Zhu, H., Zhang, Y., and Li, S. Deepss: Exploring splice site motif through convolutional neural network directly from dna sequence. IEEE Access vol. PP, pp. 1–1, 06, 2018.
Hochreiter, S. and Schmidhuber, J. Long Short-Term Memory. Neural Computation 9 (8): 1735–1780, 11, 1997.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models, 2021.
Jaganathan, K., Kyriazopoulou Panagiotopoulou, S., McRae, J. F., Darbandi, S. F., Knowles, D., Li, Y. I., Kosmicki, J. A., Arbelaez, J., Cui, W., Schwartz, G. B., Chow, E. D., Kanterakis, E., Gao, H., Kia, A., Batzoglou, S., Sanders, S. J., and Farh, K. K.-H. Predicting splicing from primary sequence with deep learning. Cell 176 (3): 535–548.e24, Jan, 2019.
Ji, Y., Zhou, Z., Liu, H., and Davuluri, R. V. Dnabert: Pre-trained bidirectional encoder representations for dna sequences. Bioinformatics 37 (15): 2112–2120, 2021.
Jónsson, B. A., Halldórsson, G. H., Árdal, S., Rögnvaldsson, S., Einarsson, E., Sulem, P., Guðbjartsson, D. F., Melsted, P., Stefánsson, K., and Úlfarsson, M. Ö. Transformers significantly improve splice site prediction. Communications Biology 7 (1): 1616, December, 2024.
Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11): 2278–2324, 1998.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pre-training. [link], 2018. OpenAI Technical Report.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (140): 1–67, 2020.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-propagating errors. Nature 323 (6088): 533–536, 1986.
Sarkar, R., Chatterjee, C., Das, S., and Mondal, D. Splice junction prediction in dna sequence using multilayered rnn model. In Proceedings of the International Conference on Computer Vision and Image Processing (CVIP 2019), A. K. Singh, P. Choudhury, and P. P. Chattopadhyay (Eds.). Springer International Publishing, Cham, Switzerland, pp. 39–47, 2020.
Sharp, P. A. The discovery of split genes and rna splicing. Trends in Biochemical Sciences vol. 30, pp. 279–281, 2005.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need, 2017.
Wang, R., Wang, Z., Wang, J., and Li, S. Splicefinder: ab initio prediction of splice sites using convolutional neural network. BMC Bioinformatics 20 (23): 652, Dec, 2019.
Watson, J. D., Baker, T. A., Bell, S. P., Gann, A., Levine, M., and Losick, R. Biologia Molecular do Gene. Vol. 7. Artmed, Porto Alegre, 2015.
Zhang, D., Zhang, W., Zhao, Y., Zhang, J., He, B., Qin, C., and Yao, J. Dnagpt: A generalized pre-trained tool for versatile dna sequence analysis tasks, 2023.
Zhang, Y., Liu, X., MacLeod, J., and Liu, J. Discerning novel splice junctions derived from rna-seq alignment: a deep learning approach. BMC Genomics 19 (1): 971, Dec, 2018.
Zuallaert, J., Godin, F., Kim, M., Soete, A., Saeys, Y., and De Neve, W. Splicerover: interpretable convolutional neural networks for improved splice site prediction. Bioinformatics 34 (24): 4180–4188, 06, 2018.
