Make No Mistake! Why Do Tools Make Incorrect Long Non-coding RNA Classification?

Alisson G. Chiquitto; Lucas Otávio L. Silva; Liliane Santana Oliveira; Douglas S. Domingues; Alexandre R. Paschoal

Alisson G. Chiquitto UTFPR / IFMS
Lucas Otávio L. Silva UTFPR
Liliane Santana Oliveira UTFPR
Douglas S. Domingues UTFPR / USP
Alexandre R. Paschoal UTFPR

Resumo

Long non-coding RNAs (lncRNAs) play important roles in various biological processes, and their accurate identification is essential for understanding their functions and potential therapeutic applications. In a previous study, we assessed the impact of short and long reads sequencing technologies on long non-coding RNA computational identification in human and plant data. We provided evidence of where and how to make potential better approaches for the lncRNA classification. In this follow-up study, we investigate the misclassified sequences by five machine learning tools for lncRNA classification in humans to understand the reasons behind the failures of the tools. Our analysis suggests that the primary cause for the failures of these tools is the overlap of two coding regions by lncRNAs, similar to a chimeric sequence. Furthermore, we emphasize the need to view genes as transcriptional units, as the transcript will define the gene function. These insights underscore the need for further refinement and improvement of these tools to enhance their accuracy and reliability in lncRNA prediction and classification, ultimately contributing to a better understanding of the role of lncRNAs in various biological processes and potential therapeutic applications.

Palavras-chave: Non-coding RNAs, high-throughput sequencing technologies, coding, methods, benchmarking

Referências

Burgess, D.J.: Genomics: Next regeneration sequencing for reference genomes. Nat Rev Genet 19(3), 125 (2018)

Chiquitto, A.G., Silva, L.O.L., Oliveira, L.S., Domingues, D.S., Paschoal, A.R.: Impact of sequencing technologies on long non-coding RNA computational identification. 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (2022)

Ensembl: Ensembl genome browser enst00000461287 (2023). [link]. Accessed 31 Mar 2023

Ensembl: Ensembl genome browser enst00000539086 (2023). [link]. Accessed 31 Mar 2023

Ensembl: Ensembl genome browser enst00000623502 (2023). [link]. Accessed 31 Mar 2023

Ensembl: Ensembl genome browser enst00000648391 (2023). [link]. Accessed 31 Mar 2023

Ensembl: Ensembl genome browser enst00000668205 (2023). [link]. Accessed 31 Mar 2023

Frankish, A., et al.: GENCODE 2021. Nucleic Acids Res. 49(D1), D916–D923 (2020). https://doi.org/10.1093/nar/gkaa1087

Klapproth, C., Sen, R., Stadler, P.F., Findeiß, S., Fallmann, J.: Common Features in lncRNA Annotation and Classification: A Survey. Non-Coding RNA 7(4), 77 (2021). https://doi.org/10.3390/ncrna7040077

Lagarde, J., et al.: High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nature Genetics 2017 49:12 49(12), 1731–1740 (2017). https://doi.org/10.1038/ng.3988

Nabi, A., Dilekoglu, B., Adebali, O., Tastan, O.: Discovering misannotated lncRNAs using deep learning training dynamics. Bioinformatics 39(1) (2023). https://doi.org/10.1093/bioinformatics/btac821

Pollard, M.O., Gurdasani, D., Mentzer, A.J., Porter, T., Sandhu, M.S.: Long reads: their purpose and place. Human Molecular Genetics 27(R2), R234–R241 (2018). https://doi.org/10.1093/hmg/ddy177

Wang, Y., et al.: Identification of the cross-strand chimeric RNAs generated by fusions of bi-directional transcripts. Nature communications 12(1), 4645 (2021). https://doi.org/10.1038/s41467-021-24910-2

Xie, S.Q., et al.: ISOdb: A Comprehensive Database of Full-Length Isoforms Generated by Iso-Seq. International Journal of Genomics 2018, 1–6 (2018). https://doi.org/10.1155/2018/9207637 [link]

Yuan, Y., Bayer, P.E., Batley, J., Edwards, D.: Improvements in genomic technologies: Application to crop genomics. Trends in Biotechnology 35(6), 547–558 (2017).