SemFORMS: Automatic Generation of Semantic Transforms By Mining Data Science Code

SemFORMS: Automatic Generation of Semantic Transforms By Mining Data Science Code

Ibrahim Abdelaziz, Julian Dolby, Udayan Khurana, Horst Samulowitz, Kavitha Srinivas

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
Demo Track. Pages 7106-7109. https://doi.org/10.24963/ijcai.2023/827

Careful choice of feature transformations in a dataset can help predictive model performance, data understanding and data exploration. However, finding useful features is a challenge, and while recent Automated Machine Learning (AutoML) systems provide some limited automation for feature engineering or data exploration, it is still mostly done by humans. We demonstrate a system called SemFORMS (Semantic Transforms), which attempts to mine useful expressions for a dataset from access to a repository of code that may target the same dataset/similar dataset. In many enterprises, numerous data scientists often work on the same or similar datasets, but are largely unaware of each other's work. SemFORMS finds appropriate code from such a repository, and normalizes the code to be an actionable transform that can prepended into any AutoML pipeline. We demonstrate SemFORMS operating over example datasets from the OpenML benchmarks where it sometimes leads to significant improvements in AutoML performance.
Keywords:
Machine Learning: ML: Automated machine learning