A Shuffle-Based Statistical Approach for Robust Pseudogene Annotation

Pedro M. Barcelos; Marcos Catanho; Antônio B. de Miranda; Edward H. Haeusler; Sérgio Lifschitz

doi:10.5753/bsb.2025.15172

Pedro M. Barcelos Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio)
Marcos Catanho Fundação Oswaldo Cruz (Fiocruz)
Antônio B. de Miranda Fundação Oswaldo Cruz (Fiocruz)
Edward H. Haeusler Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio)
Sérgio Lifschitz Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio)

DOI: https://doi.org/10.5753/bsb.2025.15172

Resumo

The accurate annotation of pseudogenes is a significant challenge in genomics, as their decaying sequences often fall into a "twilight zone" of similarity that confounds automated methods. This paper describes a robust, homology-based methodology designed to overcome this issue. The core of the approach is a shuffle-based statistical evaluation used to establish a custom, empirically-derived significance threshold. This allows for the confident discrimination of true, biologically significant sequence remnants from stochastic background noise, providing a reliable framework for annotating pseudogenes and unannotated coding sequences in large-scale genomic projects.

Palavras-chave: Pseudogene Annotation

Referências

(2025). Uniprot: the universal protein knowledgebase in 2025. Nucleic acids research, 53(D1):D609–D617.

Abrahim, M., Machado, E., Alvarez-Valín, F., de Miranda, A. B., and Catanho, M. (2022). Uncovering pseudogenes and intergenic protein-coding sequences in tritryps’ genomes. Genome Biology and Evolution, 14(10):evac142.

Blattner, F. R., Plunkett III, G., Bloch, C. A., Perna, N. T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J. D., Rode, C. K., Mayhew, G. F., et al. (1997). The complete genome sequence of escherichia coli k-12. science, 277(5331):1453–1462.

Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T. L. (2009). Blast+: architecture and applications. BMC bioinformatics, 10(1):421.

Pearson, W. R. (2000). Flexible sequence similarity searching with the fasta3 program package. In Bioinformatics methods and protocols, pages 185–219. Springer.

Rost, B. (1999). Twilight zone of protein sequence alignments. Protein engineering, 12(2):85–94.

Smith, T. F., Waterman, M. S., et al. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(1):195–197.

Xiao, J., Sekhwal, M. K., Li, P., Ragupathy, R., Cloutier, S., Wang, X., and You, F. M. (2016). Pseudogenes and their genome-wide prediction in plants. International Journal of Molecular Sciences, 17(12):1991.

Zheng, D., Frankish, A., Baertsch, R., Kapranov, P., Reymond, A., Choo, S. W., Lu, Y., Denoeud, F., Antonarakis, S. E., Snyder, M., et al. (2007). Pseudogenes in the encode regions: consensus annotation, analysis of transcription, and evolution. Genome research, 17(6):839–851.