TY - JOUR
T1 - Detection of malicious code by applying machine learning classifiers on static features
T2 - A state-of-the-art survey
AU - Shabtai, Asaf
AU - Moskovitch, Robert
AU - Elovici, Yuval
AU - Glezer, Chanan
PY - 2009/2
Y1 - 2009/2
N2 - This research synthesizes a taxonomy for classifying detection methods of new malicious code by Machine Learning (ML) methods based on static features extracted from executables. The taxonomy is then operationalized to classify research on this topic and pinpoint critical open research issues in light of emerging threats. The article addresses various facets of the detection challenge, including: file representation and feature selection methods, classification algorithms, weighting ensembles, as well as the imbalance problem, active learning, and chronological evaluation. From the survey we conclude that a framework for detecting new malicious code in executable files can be designed to achieve very high accuracy while maintaining low false positives (i.e. misclassifying benign files as malicious). The framework should include training of multiple classifiers on various types of features (mainly OpCode and byte n-grams and Portable Executable Features), applying weighting algorithm on the classification results of the individual classifiers, as well as an active learning mechanism to maintain high detection accuracy. The training of classifiers should also consider the imbalance problem by generating classifiers that will perform accurately in a real-life situation where the percentage of malicious files among all files is estimated to be approximately 10%.
AB - This research synthesizes a taxonomy for classifying detection methods of new malicious code by Machine Learning (ML) methods based on static features extracted from executables. The taxonomy is then operationalized to classify research on this topic and pinpoint critical open research issues in light of emerging threats. The article addresses various facets of the detection challenge, including: file representation and feature selection methods, classification algorithms, weighting ensembles, as well as the imbalance problem, active learning, and chronological evaluation. From the survey we conclude that a framework for detecting new malicious code in executable files can be designed to achieve very high accuracy while maintaining low false positives (i.e. misclassifying benign files as malicious). The framework should include training of multiple classifiers on various types of features (mainly OpCode and byte n-grams and Portable Executable Features), applying weighting algorithm on the classification results of the individual classifiers, as well as an active learning mechanism to maintain high detection accuracy. The training of classifiers should also consider the imbalance problem by generating classifiers that will perform accurately in a real-life situation where the percentage of malicious files among all files is estimated to be approximately 10%.
UR - http://www.scopus.com/inward/record.url?scp=65749099969&partnerID=8YFLogxK
U2 - 10.1016/j.istr.2009.03.003
DO - 10.1016/j.istr.2009.03.003
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:65749099969
SN - 1363-4127
VL - 14
SP - 16
EP - 29
JO - Information Security Technical Report
JF - Information Security Technical Report
IS - 1
ER -