DNApp: Challenges in Maintaining Host Integrity via Instruction-Based Application Characterization
Abstract
In this paper, we introduce DNApp, a method for identifying privileged Linux programs based on syntactically analyzing their assembly instructions. To do so, we apply TF-IDF to vectorize opcode n-grams (bigrams, trigrams, and 4-grams) extracted from five Ubuntu versions’ executable files. Then, we evaluate the average vectors using k-means and Silhouette coefficient to show that smaller samples can be better distinguished with 128-256 dimensions 4-grams, whereas 512 dimension bigrams works better for greater sets. Although we found clustering consistency, the method presents limitations (e.g., overlapping, undersized vectors). Overall, DNApp is promising to identify malicious changes in privileged binaries without relying on static signatures.References
Alan Lacerda (2021). O formato elf (executable and linking format). [link]. Acesso em: maio 2025.
Arthur, D. and Vassilvitskii, S. (2006). k-means++: The advantages of careful seeding. Technical report, Stanford.
Bureau, P.-M., Étienne M. Léveillé, M., and Bilodeau, O. (2014). Operation windigo: The vivisection of a large linux server-side credential stealing malware campaign. Technical report, ESET.
Canonical Ltd. (2025). Ubuntu release notes. [link].
Chen, W., Zhao, S., and Zhang, L. (2021). Static malware detection via opcode unigram frequency and $k$-nn. Security and Communication Networks, 2021:1–13.
Debian Project (2024). Debian policy manual: Checksums in /var/lib/dpkg/status. [link].
Docker Documentation (2025). Docker Engine Reference. Docker, Inc. Versão 26.1.3. Disponível em: [link].
Edge, J. (2024). Backdoor discovered in xz utils compression library. [link]. Publicado em: 29 mar. 2024. Acesso em: maio 2025.
Free Software Foundation (2024). GNU Binutils Manual. GNU Project. Disponível em: [link].
Gray, J., Sgandurra, D., Cavallaro, L., and Alis, J. B. (2024). Identifying authorship in malicious binaries: Features, challenges & datasets. ACM Computing Surveys.
Greenberg, A. (2023). The huge 3cx breach was actually 2 linked supply chain attacks. WIRED.
Han, J., Kamber, M., and Pei, J. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann, Waltham, MA, 3rd edition.
Jalilian, A., Narimani, Z., and Ansari, E. (2020). Static signature-based malware detection using opcode and binary information. In Data Science: From Research to Application, volume 45 of Lecture Notes on Data Engineering and Communications Technologies, pages 24–35. Springer.
Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis. John Wiley & Sons.
Linux Foundation (2015). Filesystem hierarchy standard, version 3.0. [link]. Seções 4.13 (/usr/bin) e 4.14 (/usr/sbin). Acesso em: maio 2025.
Romanov, A., Kurtukova, A., Fedotova, A., and Shelupanov, A. (2023). Authorship identification of binary and disassembled codes using nlp methods. Information, 14(7):361.
Saini, V., Gupta, R., and Soni, N. (2025). Opcode-based malware classification using machine learning and deep learning techniques. arXiv preprint arXiv:2504.13408.
Salton, G. and Yang, C.-S. (1973). On the specification of term values in automatic indexing. Journal of documentation, 29(4):351–372.
Santos, I., Brezo, F., Ugarte-Pedrero, X., and Bringas, P. G. (2013). Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Sciences, 231:64–82.
van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
W3Techs (2025). Usage statistics of linux for websites. [link].
Zhang, B., Xiao, W., Xiao, X., Sangaiah, A. K., Zhang, W., and Zhang, J. (2020). Ransomware classification using patch-based cnn and self-attention network on embedded n-grams of opcodes. Future Generation Computer Systems, 110:708–720.
Arthur, D. and Vassilvitskii, S. (2006). k-means++: The advantages of careful seeding. Technical report, Stanford.
Bureau, P.-M., Étienne M. Léveillé, M., and Bilodeau, O. (2014). Operation windigo: The vivisection of a large linux server-side credential stealing malware campaign. Technical report, ESET.
Canonical Ltd. (2025). Ubuntu release notes. [link].
Chen, W., Zhao, S., and Zhang, L. (2021). Static malware detection via opcode unigram frequency and $k$-nn. Security and Communication Networks, 2021:1–13.
Debian Project (2024). Debian policy manual: Checksums in /var/lib/dpkg/status. [link].
Docker Documentation (2025). Docker Engine Reference. Docker, Inc. Versão 26.1.3. Disponível em: [link].
Edge, J. (2024). Backdoor discovered in xz utils compression library. [link]. Publicado em: 29 mar. 2024. Acesso em: maio 2025.
Free Software Foundation (2024). GNU Binutils Manual. GNU Project. Disponível em: [link].
Gray, J., Sgandurra, D., Cavallaro, L., and Alis, J. B. (2024). Identifying authorship in malicious binaries: Features, challenges & datasets. ACM Computing Surveys.
Greenberg, A. (2023). The huge 3cx breach was actually 2 linked supply chain attacks. WIRED.
Han, J., Kamber, M., and Pei, J. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann, Waltham, MA, 3rd edition.
Jalilian, A., Narimani, Z., and Ansari, E. (2020). Static signature-based malware detection using opcode and binary information. In Data Science: From Research to Application, volume 45 of Lecture Notes on Data Engineering and Communications Technologies, pages 24–35. Springer.
Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis. John Wiley & Sons.
Linux Foundation (2015). Filesystem hierarchy standard, version 3.0. [link]. Seções 4.13 (/usr/bin) e 4.14 (/usr/sbin). Acesso em: maio 2025.
Romanov, A., Kurtukova, A., Fedotova, A., and Shelupanov, A. (2023). Authorship identification of binary and disassembled codes using nlp methods. Information, 14(7):361.
Saini, V., Gupta, R., and Soni, N. (2025). Opcode-based malware classification using machine learning and deep learning techniques. arXiv preprint arXiv:2504.13408.
Salton, G. and Yang, C.-S. (1973). On the specification of term values in automatic indexing. Journal of documentation, 29(4):351–372.
Santos, I., Brezo, F., Ugarte-Pedrero, X., and Bringas, P. G. (2013). Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Sciences, 231:64–82.
van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
W3Techs (2025). Usage statistics of linux for websites. [link].
Zhang, B., Xiao, W., Xiao, X., Sangaiah, A. K., Zhang, W., and Zhang, J. (2020). Ransomware classification using patch-based cnn and self-attention network on embedded n-grams of opcodes. Future Generation Computer Systems, 110:708–720.
Published
2025-09-01
How to Cite
SILVA, Felipe Duarte; ALVES, Marco Zanata; ALMEIDA, Paulo Lisboa de; GRÉGIO, André.
DNApp: Challenges in Maintaining Host Integrity via Instruction-Based Application Characterization. In: BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 25. , 2025, Foz do Iguaçu/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 1027-1034.
DOI: https://doi.org/10.5753/sbseg.2025.11435.
