Creating a model to detect malware using supervised learning algorithms Background Product Development Grant TOBORRM’s has received an industry grant to develop malware detection algorithms based on behaviours and file parameters. The software development team at TOBORRM wrote a file-download identifier that scoured the internet for downloadable content. The goal was to develop a data set that can be used to identify malware based on parameters such: Where the file came fromHow big the file wasWhat type of file it isas well as many other characteristics (or features). MalwareSamples Data The programmers found millions of files and proceeded to classify the files manually using a 3rd party system at virustotal.com. As noted in the previous assessment for each file specimen collected: TOBORRM’s data collector would send the file to virustotal.comfiles were tagged as “Malicious” if a majority of virustotal.com virus scanners recognised the file as containing malware (see Figure 1)Files were tagged as “Clean” if ALL virustotal.com scanners identified the file as “Clean”. (see Figure 1) Figure 1 – VirusTotal.com comparison of confirmed infected vs confirmed clean As such, the “Actually Malicious” field can be considered to be a generally accurate classification for each downloaded sample. Initially the security and software development teams believed they would be able to gain insight from various statistical analyses of the dataset. Their initial attempts to classify data lacked sensitivity and had many false positives, the results of TOBORRM’s analysis have been included in the “Initial Statistical Analysis” column – the results of this analysis are poor. Data set columns The data set created by TOBORRM’s developers includes the following descriptions each column’s source: Download SourceA description of where the sample came fromTLDTop Level Domain of the site where the sample came fromDownload SpeedSpeed recorded when obtaining the samplePing Time To ServerPing time to the server recorded when accessing the sampleFile Size (Bytes)The size of the sample fileHow Many Times File SeenHow many other times this sample has been seen at other sites (and not downloaded)Executable Code Maybe Present in Headers‘CodeCheck’ Program has flagged the file as possibly containing executable code in file headersCalls to Low-Level System LibrariesWhen the file was opened or run, how many times were low-level Windows System libraries accessedEvidence of Code Obfuscation‘CodeCheck’ Program indicates that the contents of the file may be ObfuscatedThreads StartedHow many threads were started when this file was accessed or launchedMean Word Length of Extracted StringsMean length of text strings extracted from file using Unix ‘strings’ programSimilarity ScoreAn unknown scoring system used by ‘CodeCheck’ seems to be the score of how similar the file is to other files recognised by ‘CodeCheck’Characters in URLHow long the URL is (after the .com / .net part). E.g., /index.html = 10 charactersActually MaliciousThe correct classification for the filePrevious System PerformancePerformance of “FileSentry3000™ v1.0” SCENARIO The industry grant from TOBORRM requires that they provide a clear case for whether machine learning algorithms could solve the problem of classifying malicious software. Your task is to build on your previous work and run the data through appropriate machine learning modelling approaches, and tuned to optimise their accuracy. TASK You are to train your selected supervised machine learning algorithms using the master dataset provided, and compare their performance to each other and to TOBORRM’s initial attempt to classify the samples. Part 1 – General data preparation and cleaning. Import the MLDATASET_PartiallyCleaned.xlsx into R Studio. This dataset is a partially cleaned version of MLDATASET-200000-1612938401.xlsx.Write the appropriate code in R Studio to prepare and clean the MLDATASET_PartiallyCleaned dataset as follows:For How.Many.Times.File.Seen, set all values = 65535 to NA;Convert Threads.Started to a factor whose categories are given by 1 = 1 thread started 2 = 2 threads started 3 = 3 threads started 4 = 4 threads started 5 = 5 or more threads started Hint: Replace all values greater than 5 with 5, then use the factor(.) function. Log-transform Characters.in.URL using the log(.) function, and remove the original Characters.in.URL column from the dataset (unless you have overwritten it with the log-transformed data)Select only the complete cases using the na.omit(.) function, and name the dataset MLDATASET.cleaned. Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary. Write the appropriate code in R Studio to partition the data into training and test sets using an 30/70 split. Be sure to set the randomisation seed using your student ID. Export both the training and test datasets as csv files, and these will need to be submitted along with your code. Note that the training set is typically larger than the test set in practice. However, given the size of this dataset, you will only use 30% of the data to train your ML models to save time. Part 2 – Compare the performances of different machine learning algorithms Select three supervised learning modelling algorithms to test against one another by running the following code. Make sure you enter your student ID into the command set.seed(.). Your 3 modelling approaches are given by myModels. library(tidyverse)set.seed(Enter your student ID)models.list1
