TY - JOUR
T1 - Extending limited datasets with GAN-like self-supervision for SMS spam detection
AU - Anidjar, Or Haim
AU - Marbel, Revital
AU - Dubin, Ran
AU - Dvir, Amit
AU - Hajaj, Chen
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2024/10
Y1 - 2024/10
N2 - Short Message Service (SMS) spamming is a harmful phishing attack on mobile phones. That is, fraudsters are trying to misuse personal user information, using tricky text messages, sometimes included with a fake URL that asks for this personal information, such as passwords, usernames, etc. In the world of Machine Learning, several approaches have tried to attitudinize this problem, but the lack of available data resources was commonly the main drawback towards a good enough solution. Therefore, in this paper, we suggest a dataset extension technique for small datasets, based on an Out Of Distribution (OOD) metric. Hence, different approaches such as Generative Adversarial Networks (GANs) were suggested, yet GANs are hard to train whenever datasets are limited in terms of sample size. In this paper, we present a GAN-like method that imitates the generator concept of GANs for the purpose of limited datasets extension, using the OOD concept. By using a sophisticated text generation method, we show how to apply it over datasets from the domain of fraud and spam detection in SMS messages, and achieve over 25% relative improvement, compared to two other solutions. In addition, due to the class imbalance in typical spam datasets, our approach is being examined over another dataset, in order to verify that the false alarm rate is low enough.
AB - Short Message Service (SMS) spamming is a harmful phishing attack on mobile phones. That is, fraudsters are trying to misuse personal user information, using tricky text messages, sometimes included with a fake URL that asks for this personal information, such as passwords, usernames, etc. In the world of Machine Learning, several approaches have tried to attitudinize this problem, but the lack of available data resources was commonly the main drawback towards a good enough solution. Therefore, in this paper, we suggest a dataset extension technique for small datasets, based on an Out Of Distribution (OOD) metric. Hence, different approaches such as Generative Adversarial Networks (GANs) were suggested, yet GANs are hard to train whenever datasets are limited in terms of sample size. In this paper, we present a GAN-like method that imitates the generator concept of GANs for the purpose of limited datasets extension, using the OOD concept. By using a sophisticated text generation method, we show how to apply it over datasets from the domain of fraud and spam detection in SMS messages, and achieve over 25% relative improvement, compared to two other solutions. In addition, due to the class imbalance in typical spam datasets, our approach is being examined over another dataset, in order to verify that the false alarm rate is low enough.
KW - Fraud detection
KW - GAN
KW - Out of Distribution
KW - SMS-spamming
KW - Textual anomaly detection
UR - http://www.scopus.com/inward/record.url?scp=85199100521&partnerID=8YFLogxK
U2 - 10.1016/j.cose.2024.103998
DO - 10.1016/j.cose.2024.103998
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85199100521
SN - 0167-4048
VL - 145
JO - Computers and Security
JF - Computers and Security
M1 - 103998
ER -