TY - GEN
T1 - Demonstrating SubStrat
T2 - 31st ACM International Conference on Information and Knowledge Management, CIKM 2022
AU - Lazebnik, Teddy
AU - Somech, Amit
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/10/17
Y1 - 2022/10/17
N2 - Automated machine learning (AutoML) frameworks are gaining popularity among data scientists as they dramatically reduce the manual work devoted to the construction of ML pipelines while obtaining similar and sometimes even better results than manually-built models. Such frameworks intelligently search among millions of possible ML pipeline configurations to finally retrieve an optimal pipeline in terms of predictive accuracy. However, when the training dataset is large, the construction and evaluation of a single ML pipeline take longer, which makes the overall AutoML running times increasingly high. To this end, in this work we demonstrate SubStrat, an AutoML optimization strategy that tackles the dataset size rather than the configurations search space. SubStrat wraps existing AutoML tools, and instead of executing them directly on the large dataset, it uses a genetic-based algorithm to find a small yet representative data subset that preserves characteristics of the original one. SubStrat then employs the AutoML tool on the generated subset, resulting in an intermediate ML pipeline, which is later refined by executing a restricted, much shorter, AutoML process on the large dataset. We demonstrate SubStrat on both AutoSklearn, TPOT, and H2O, three popular AutoML frameworks, using several real-life datasets.
AB - Automated machine learning (AutoML) frameworks are gaining popularity among data scientists as they dramatically reduce the manual work devoted to the construction of ML pipelines while obtaining similar and sometimes even better results than manually-built models. Such frameworks intelligently search among millions of possible ML pipeline configurations to finally retrieve an optimal pipeline in terms of predictive accuracy. However, when the training dataset is large, the construction and evaluation of a single ML pipeline take longer, which makes the overall AutoML running times increasingly high. To this end, in this work we demonstrate SubStrat, an AutoML optimization strategy that tackles the dataset size rather than the configurations search space. SubStrat wraps existing AutoML tools, and instead of executing them directly on the large dataset, it uses a genetic-based algorithm to find a small yet representative data subset that preserves characteristics of the original one. SubStrat then employs the AutoML tool on the generated subset, resulting in an intermediate ML pipeline, which is later refined by executing a restricted, much shorter, AutoML process on the large dataset. We demonstrate SubStrat on both AutoSklearn, TPOT, and H2O, three popular AutoML frameworks, using several real-life datasets.
KW - automated machine learning (AutoML)
KW - data reduction
UR - http://www.scopus.com/inward/record.url?scp=85140835116&partnerID=8YFLogxK
U2 - 10.1145/3511808.3557160
DO - 10.1145/3511808.3557160
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85140835116
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 4907
EP - 4911
BT - CIKM 2022 - Proceedings of the 31st ACM International Conference on Information and Knowledge Management
Y2 - 17 October 2022 through 21 October 2022
ER -