Publication

Machine learning approaches to code similarity measurement: A systematic review

Saber, Takfarinas
Zhang, Zixian
Citation
Zhang, Z., & Saber, T. (2025). Machine Learning Approaches to Code Similarity Measurement: A Systematic Review. IEEE Access, 13, 51729-51764. https://dx.doi.org/10.1109/ACCESS.2025.3553392
Abstract
Source code similarity measurement, which involves assessing the degree of difference between code segments, plays a crucial role in various aspects of the software development cycle. These include but are not limited to code quality assurance, code review processes, code plagiarism detection, security, and vulnerability analysis. Despite the increasing application of ML technique in this domain, a comprehensive synthesis of existing methodologies remains lacking. This paper presents a systematic review of Machine Learning techniques applied to code similarity measurement, aiming to illuminate current methodologies and contribute valuable insights to the research community. Following a rigorous systematic review protocol, we identified and analyzed 84 primary studies on a broad spectrum of dimensions covering application type, devised Machine Learning algorithms, used code representations, datasets, and performance metrics, as well as performance evaluations. A deep investigation reveals that 15 applications for code similarity measurement have utilized 51 different machine learning algorithms. Additionally, the most prevalent code representation is found to be the abstract syntax tree (AST). Furthermore, the most frequently employed dataset across various code similarity research applications is BigCloneBench. Through this comprehensive analysis, the paper not only synthesizes existing research but also identifies prevailing limitations and challenges, shedding light on potential avenues for future work.
Funder
Publisher
Institute of Electrical and Electronics Engineers
Publisher DOI
Rights
Attribution 4.0 International