Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code

Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code

Huihui Wei, Ming Li

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence
Main track. Pages 3034-3040. https://doi.org/10.24963/ijcai.2017/423

Software clone detection, aiming at identifying out code fragments with similar functionalities, has played an important role in software maintenance and evolution. Many clone detection approaches have been proposed. However, most of them represent source codes with hand-crafted features using lexical or syntactical information, or unsupervised deep features, which makes it difficult to detect the functional clone pairs, i.e., pieces of codes with similar functionality but differing in both syntactical and lexical level. In this paper, we address the software functional clone detection problem by learning supervised deep features. We formulate the clone detection as a supervised learning to hash problem and propose an end-to-end deep feature learning framework called CDLH for functional clone detection. Such framework learns hash codes by exploiting the lexical and syntactical information for fast computation of functional similarity between code fragments. Experiments on software clone detection benchmarks indicate that the CDLH approach is effective and outperforms the state-of-the-art approaches in software functional clone detection.
Keywords:
Machine Learning: Data Mining
Machine Learning: Machine Learning
Multidisciplinary Topics and Applications: Knowledge-based Software Engineering