Positive and Unlabeled Learning for Detecting Software Functional Clones with Adversarial Training

Positive and Unlabeled Learning for Detecting Software Functional Clones with Adversarial Training

Hui-Hui Wei, Ming Li

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
Main track. Pages 2840-2846. https://doi.org/10.24963/ijcai.2018/394

Software clone detection is an important problem for software maintenance and evolution and it has attracted lots of attentions. However, existing approaches ignore a fact that people would label the pairs of code fragments as \emph{clone} only if they happen to discover the clones while a huge number of undiscovered clone pairs and non-clone pairs are left unlabeled. In this paper, we argue that the clone detection task in the real-world should be formalized as a Positive-Unlabeled (PU) learning problem, and address this problem by proposing a novel positive and unlabeled learning approach, namely CDPU, to effectively detect software functional clones, i.e., pieces of codes with similar functionality but differing in both syntactical and lexical level, where adversarial training is employed to improve the robustness of the learned model to those non-clone pairs that look extremely similar but behave differently. Experiments on software clone detection benchmarks indicate that the proposed approach together with adversarial training outperforms the state-of-the-art approaches for software functional clone detection.
Keywords:
Machine Learning: Semi-Supervised Learning
Multidisciplinary Topics and Applications: Knowledge-based Software Engineering