TY - GEN

T1 - Multi-GPU processing of unstructured data for machine learning

AU - Ratsaby, Joel

AU - Timashkov, Alexander

N1 - Publisher Copyright:
© 2024 Research Paper Proceedings of the ISC High Performance 2024. All rights reserved.

PY - 2024

Y1 - 2024

N2 - We introduce a method for processing unstructured data for machine learning based on an LZ-complexity string distance. Computing the LZ-complexity is inherently a serial data compression process; hence, we introduce a string distance computed by a parallel algorithm that utilizes multiple GPU devices to process unstructured data, which typically exists in large quantities. We use this algorithm to compute a distance matrix representation of the unstructured data that standard learning algorithms can use to learn. Our approach eliminates the need for human-based feature definition or extraction. Except for some simple data reformatting done manually, our proposed approach operates on the original raw data and is fully automatic. The parallel computation of the distance matrix is efficient. It obtains a speed-up factor of 528 in computing the distance matrix between every possible pair of 16 strings of length 1M bytes. We show that for learning time-series classification, relative to the ubiquitous TFIDF data representation, the distance-matrix representation yields a higher learning accuracy for most of a broad set of learning algorithms. Thus, the parallel algorithm can be helpful in efficiently and accurately learning from unstructured data.

AB - We introduce a method for processing unstructured data for machine learning based on an LZ-complexity string distance. Computing the LZ-complexity is inherently a serial data compression process; hence, we introduce a string distance computed by a parallel algorithm that utilizes multiple GPU devices to process unstructured data, which typically exists in large quantities. We use this algorithm to compute a distance matrix representation of the unstructured data that standard learning algorithms can use to learn. Our approach eliminates the need for human-based feature definition or extraction. Except for some simple data reformatting done manually, our proposed approach operates on the original raw data and is fully automatic. The parallel computation of the distance matrix is efficient. It obtains a speed-up factor of 528 in computing the distance matrix between every possible pair of 16 strings of length 1M bytes. We show that for learning time-series classification, relative to the ubiquitous TFIDF data representation, the distance-matrix representation yields a higher learning accuracy for most of a broad set of learning algorithms. Thus, the parallel algorithm can be helpful in efficiently and accurately learning from unstructured data.

KW - CUDA

KW - LZ-complexity

KW - multi-GPU

KW - string distance

UR - http://www.scopus.com/inward/record.url?scp=85195106833&partnerID=8YFLogxK

M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???

AN - SCOPUS:85195106833

T3 - Research Paper Proceedings of the ISC High Performance 2024

BT - Research Paper Proceedings of the ISC High Performance 2024

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 39th International Conference on High Performance Computing, ISC High Performance 2024

Y2 - 12 May 2024 through 16 May 2024

ER -