Leveraging Sample-Specific Strings to Enhance Fusion Transcript Detection
Resumo
Fusion transcripts are widely used biomarkers (e.g., in cancer diagnosis), but current methods to detect them from long-read sequencing data still yield a high number of false positives. In this study, we attack this problem by selecting reads containing absent strings in a reference transcriptome — sample-specific strings (SFS) — before applying a fusion detection method. We adapted an existing SFS retrieval algorithm to long-read RNA-seq data and applied it to select reads in eight simulated datasets. These reads were then fed to three transcript fusion detection methods — LongGF, JAFFAL, and CTAT-LR. Our results show that SFSs capture the reads with relevant information and, hence, improve the accuracy of most of these fusion detection methods.
Referências
Davidson, N. (2021). Long read fusion simulation. [link]. Accessed: 2024-08-10.
Davidson, N. M., Chen, Y., Sadras, T., Ryland, G. L., Blombery, P., Ekert, P. G., Göke, J., and Oshlack, A. (2022). JAFFAL: detecting fusion genes with long-read transcriptome sequencing. Genome Biology, 23(1):10.
Denti, L., Khorsand, P., Bonizzoni, P., Hormozdiari, F., and Chikhi, R. (2023). SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads. Nature Methods, 20(4):550–558.
Ferragina, P. and Manzini, G. (2000). Opportunistic data structures with applications. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 390–398.
Karaoglanoglu, F., Chauve, C., and Hach, F. (2022). Genion, an accurate tool to detect gene fusion from long transcriptomics reads. BMC Genomics, 23(1):129.
Khorsand, P., Denti, L., Human Genome Structural Variant Consortium, Bonizzoni, P., Chikhi, R., and Hormozdiari, F. (2021). Comparative genome analysis using sample-specific string detection in accurate long reads. Bioinformatics Advances, 1(1):vbab005.
Li, H. (2012). Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics, 28(14):1838–1844.
Li, H. (2024). BWT construction and search at the terabase scale. Bioinformatics, 40(12):btae717.
Liu, Q., Hu, Y., Stucky, A., Fang, L., Zhong, J. F., and Wang, K. (2020). LongGF: computational algorithm and software tool for fast and accurate detection of gene fusions by long-read transcriptome sequencing. BMC Genomics, 21(11):793.
Qin, Q., Popic, V., Wienand, K., Yu, H., White, E., Khorgade, A., Shin, A., Georgescu, C., Campbell, C. D., Dondi, A., Beerenwinkel, N., Vazquez, F., Al’Khafaji, A. M., and Haas, B. J. (2025). Accurate fusion transcript identification from long-and short-read isoform sequencing at bulk or single-cell resolution. Genome Research, 35(4):967–986.
Wick, R. R. (2018). Badread: Simulation of error-prone long reads. [link].
