ROBiT - A Binary Optimization Anti-Plagiarism Method

Roberta Robert; Bruno Castro da Silva; Jeferson Campos Nobre

doi:10.5753/sbseg.2025.11514

Roberta Robert UFRGS
Bruno Castro da Silva University of Massachusetts
Jeferson Campos Nobre UFRGS

DOI: https://doi.org/10.5753/sbseg.2025.11514

Resumo

The widespread availability of Large Language Models (LLMs) has significantly lowered the barrier to committing code plagiarism. However, most existing anti-plagiarism tools remain vulnerable to modern evasion strategies, including syntactic transformations and generative code rewriting. Prior work shows that such transformations can effectively bypass clone detectors that rely on syntactic or semantic representations. While binary optimization is a known technique in malware obfuscation, its potential for plagiarism detection has been largely overlooked. We introduce a hybrid detection method that combines source-level syntactic analysis with binary-level comparison, leveraging both standard compilation outputs and binaries generated with optimization flags. These optimizations act as a reverse filter, eliminating syntactic manipulations added to code artifacts and revealing structural similarities with the original binary. Our empirical evaluation confirms that optimized binaries exhibit patterns that correlate strongly with their original source code. The proposed method demonstrates high effectiveness in detecting plagiarism, even when the source code has undergone aggressive syntactic transformations. This technique serves as a robust and complementary extension to existing syntactic anti-plagiarism systems, offering deeper insight into semantic and structural code similarity.

Referências

Aiken, A. (2004). MOSS: A system for detecting software plagiarism. Available at: [link] (accessed May 2025).

Biderman, S. and Raff, E. (2022). Fooling MOSS detection with pretrained language models. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, CIKM ’22, page 2933–2943, New York, NY, USA. Association for Computing Machinery.

Chae, D.-K., Ha, J., Kim, S.-W., Kang, B., and Im, E. G. (2013). Software plagiarism detection: a graph-based approach. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM ’13, page 1577–1580, New York, NY, USA. Association for Computing Machinery.

DamÁsio, T., Canesche, M., Pacheco, V., Botacin, M., Faustino da Silva, A., and Quintão Pereira, F. M. (2023). A game-based framework to compare program classifiers and evaders. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization, CGO 2023, page 108–121, New York, NY, USA. Association for Computing Machinery.

Devore-McDonald, B. and Berger, E. D. (2020). Mossad: defeating software plagiarism detection. Proc. ACM Program. Lang., 4(OOPSLA).

Duracik, M., Hrkut, P., Krsak, E., and Toth, S. (2020). Abstract syntax tree based source code antiplagiarism system for large projects set. IEEE Access, 8:175347–175359.

Faidhi, J. and Robinson, S. (1987). An empirical approach for detecting program similarity and plagiarism within a university programming environment. Computers & Education, 11(1):11–19.

Free Software Foundation (2024). GCC Command Options: Options That Control Optimization. GCC Online Documentation.

Geng, C., Zhang, Y., Pientka, B., and Si, X. (2023). Can chatgpt pass an introductory level functional language programming course? arXiv preprint arXiv:2305.02230.

Levenshtein, V. I. (1965). Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics Doklady, 10:707–710.

Liu, C., Chen, C., Han, J., and Yu, P. S. (2006). GPLAG: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 872–881, New York, NY, USA. Association for Computing Machinery.

Luo, L., Ming, J., Wu, D., Liu, P., and Zhu, S. (2017). Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Transactions on Software Engineering, 43(12):1157–1177.

Nichols, L., Dewey, K., Emre, M., Chen, S., and Hardekopf, B. (2019). Syntax-based improvements to plagiarism detectors and their evaluations. In Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education, ITiCSE ’19, page 555–561, New York, NY, USA. Association for Computing Machinery.

Poornima, S. and Mahalakshmi, R. (2023). An inclusive report on robust malware detection and analysis for cross-version binary code optimizations. International Journal on Recent and Innovation Trends in Computing and Communication, 11(9):927–937.

Prechelt, L., Malpohl, G., and Philippsen, M. (2002). Finding plagiarisms among a set of programs with jplag. JUCS - Journal of Universal Computer Science, 8(11):1016–1038.

Radare2 Team (2024). Binary Diffing (online manual). Radare2: Libre Reversing Framework for Unix Geeks [GitHub Repository].

Ren, X., Ho, M., Ming, J., Lei, Y., and Li, L. (2021). Unleashing the hidden power of compiler optimization on binary code difference: an empirical study. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2021, page 142–157, New York, NY, USA. Association for Computing Machinery.

Schleimer, S., Wilkerson, D. S., and Aiken, A. (2003). Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD ’03, page 76–85, New York, NY, USA. Association for Computing Machinery.

Wise, M. J. (1996). YAP3: Improved Detection of Similarities in Computer Program and Other Texts. In Proceedings of the Twenty-Seventh SIGCSE Technical Symposium on Computer Science Education, SIGCSE ’96, page 130–134, New York, NY, USA. Association for Computing Machinery.

Yenduri, G., Ramalingam, M., ChemmalarSelvi, G., Supriya, Y., Srivastava, G., Maddikunta, P. K. R., DeeptiRaj, G., Jhaveri, R. H., Prabadevi, B., Wang, W., Vasilakos, A. V., and Gadekallu, T. R. (2023). Generative Pre-Trained Transformer: A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions. ArXiv preprint ArXiv:2305.10435v2.

Zhang, W., Guo, S., Zhang, H., Sui, Y., Xue, Y., and Xu, Y. (2021). Challenging machine learning-based clone detectors via semantic-preserving code transformations. IEEE Transactions on Software Engineering, 49:3052–3070.

ROBiT - A Binary Optimization Anti-Plagiarism Method

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)