Revisiting the Evaluation of Deep Learning-Based Compiler Testing

Yongqiang Tian; Zhenyang Xu; Yiwen Dong; Chengnian Sun; Shing-Chi Cheung

doi:10.24963/ijcai.2023/542

Revisiting the Evaluation of Deep Learning-Based Compiler Testing

Yongqiang Tian, Zhenyang Xu, Yiwen Dong, Chengnian Sun, Shing-Chi Cheung

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

Main Track. Pages 4873-4882. https://doi.org/10.24963/ijcai.2023/542

PDF BibTeX

A high-quality program generator is essential to effective automated compiler testing. Engineering such a program generator is difficult, time-consuming, and specific to the language under testing, thus requiring tremendous efforts from human experts with language-specific domain knowledge. To avoid repeatedly writing program generators for different languages, researchers recently proposed a language-agnostic approach based on deep learning techniques to automatically learn a program generator (referred to as DLG) from existing programs. Evaluations show that DLGs outperform Language-Specific Program Generators (LSGs) in testing compilers. However, we argue that it is unfair to use LSGs as baselines to evaluate DLGs. LSGs aim to validate compiler optimizations by only generating compilable, well-defined test programs; this restriction inevitably impairs the diversity of the language features used in the generated programs. In contrast, DLGs do not aim to validate the correctness of compiler optimizations, and its generated programs are not guaranteed to be well-defined or even compilable. Therefore, it is not surprising that DLG-generated programs are more diverse in terms of used language features than LSG-generated ones. This study revisits the evaluation of DLGs, and proposes a new, fair, simple yet strong baseline named Kitten for evaluating DLGs. Given a dataset consisting of human-written programs, instead of using deep learning techniques to learn a program generator, Kitten directly derives new programs by mutating the programs in the dataset. Extensive experiments with more than 1,500 CPU-hours demonstrate that the state-of-the-art DLGs fail to compete against such a simple baseline: 3 v.s. 1,750 hang bugs, 1 v.s. 34 distinct compiler crashes. We believe that DLGs still have a large room for improvement.

Keywords:

Multidisciplinary Topics and Applications: MDA: Software engineering