Main Article Content
A data augmentation-based system for future malware prediction
Abstract
Malware detection is important for computer security. However, existing signature-based malware detection systems are still not quite perfect because they are designed to recognize already established patterns of malicious codes. In this work, a malware prediction model was developed to intelligently discover future malware strains, in order to improve the capability of malware detection system. The paradigm imbibed for this method includes "malware in vision context". Particularly, the method involved the generation of new data points from a malware data distribution using generative adversarial network (GAN), parameterized with fully-connected neural network (FCN) architecture. The developed model generates malware images from a 100-dimensional Gaussian noise distribution and learns to distinguish it from real malware images. The generated malware is similar but not the same as the real malware, as it consists modified features when compared with real malware. To establish the feasibility of the proposed method for malware research, an experiment was conducted by leveraging Mallmg, an image-based malware dataset. Due to certain technical constraints as discussed in the study, 52.83% of Mallmg dataset was used to generate new malware data which yielded 224.98%, amounting to 98.66% of the original dataset. Metrics such as Mean Squared Error (MSE), Structural Similarity Index (SSIM) and a customized enhancer (ABV) were used evaluate the generated images. The best scores obtained for MSE, SSIM andABV are 0.02, 0.91 and 1.00 respectively while the worst scores are 0.07, 0.02, 0.68 respectively. Also, the uniqueness of the generated malware was established. These metrics showcase an exemplary, yet simplistic approach to malware prediction and data augmentation.