TY - JOUR
T1 - SubStrat
T2 - A Subset-Based Optimization Strategy for Faster AutoML
AU - Lazebnik, Teddy
AU - Somech, Amit
AU - Weinberg, Abraham Itzhak
N1 - Publisher Copyright:
© 2022, VLDB Endowment. All rights reserved.
PY - 2022
Y1 - 2022
N2 - Automated machine learning (AutoML) frameworks have become important tools in the data scientist’s arsenal, as they dramatically reduce the manual work devoted to the construction of ML pipelines. Such frameworks intelligently search among millions of possible ML pipelines-typically containing feature engineering, model selection, and hyper parameters tuning steps-and finally output an optimal pipeline in terms of predictive accuracy. However, when the dataset is large, each individual configuration takes longer to execute, therefore the overall AutoML running times become increasingly high. To this end, we present SubStrat, an AutoML optimization strategy that tackles the data size, rather than configuration space. It wraps existing AutoML tools, and instead of executing them di-rectly on the entire dataset, SubStrat uses a genetic-based algorithm to find a small yet representative data subset that preserves a par-ticular characteristic of the full data. It then employs the AutoML tool on the small subset, and finally, it refines the resulting pipeline by executing a restricted, much shorter, AutoML process on the large dataset. Our experimental results, performed on three popu-lar AutoML frameworks, Auto-Sklearn, TPOT, and H2O show that SubStrat reduces their running times by 76.3% (on average), with only a 4.15% average decrease in the accuracy of the resulting ML pipeline.
AB - Automated machine learning (AutoML) frameworks have become important tools in the data scientist’s arsenal, as they dramatically reduce the manual work devoted to the construction of ML pipelines. Such frameworks intelligently search among millions of possible ML pipelines-typically containing feature engineering, model selection, and hyper parameters tuning steps-and finally output an optimal pipeline in terms of predictive accuracy. However, when the dataset is large, each individual configuration takes longer to execute, therefore the overall AutoML running times become increasingly high. To this end, we present SubStrat, an AutoML optimization strategy that tackles the data size, rather than configuration space. It wraps existing AutoML tools, and instead of executing them di-rectly on the entire dataset, SubStrat uses a genetic-based algorithm to find a small yet representative data subset that preserves a par-ticular characteristic of the full data. It then employs the AutoML tool on the small subset, and finally, it refines the resulting pipeline by executing a restricted, much shorter, AutoML process on the large dataset. Our experimental results, performed on three popu-lar AutoML frameworks, Auto-Sklearn, TPOT, and H2O show that SubStrat reduces their running times by 76.3% (on average), with only a 4.15% average decrease in the accuracy of the resulting ML pipeline.
UR - http://www.scopus.com/inward/record.url?scp=85146305554&partnerID=8YFLogxK
U2 - 10.14778/3574245.3574261
DO - 10.14778/3574245.3574261
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85146305554
SN - 2150-8097
VL - 16
SP - 772
EP - 780
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 4
ER -