Overview | Reliable Papers

OverviewDuring the last weeks, you have learned how machine learning algorithms can be implementedusing python. Scikit learn is an open-source python library that can help to implement supervisedand unsupervised machine learning models. More information can be obtained from the websitehttps://sklearn.org/. In this task, machine learning algorithms will be used for cyber attackclassification. The purpose of the task is to help students to build knowledge and skills related tothe usages of supervised machine learning for security analysis, hands-on implementation andunderstand the overall goal of an end-to-end-project delivery in the area of cybersecurity analytics.Do you know what is an end-to-end data science project? See the lifecycle of an end-to-end datascience project. If you are doing a data science application for security analysis, your problem willbe related to cybersecurity and your data analysis needs to follow the below steps. See the taskdescription for the detailed instructions.Figure 1: Data Science Lifecycle [source: Sudeep, 2019 (accessed Jan 2020)]In this Distinction/Higher Distinction Task, you will experiment with Machine Learningclassification algorithms. Please see more details in the Task description. Before attempting thistask, please make sure you are already up to date with all previous Credit and Pass tasks.Task DescriptionInstructions:Suppose, you are working in an organization as a security analyst. You need to conduct an endto end project on “cyber-attack classification in the network traffic database”. To complete theproject you follow the steps in Figure 1. Here, all of the steps are already solved for you (bythe teaching team and you don’t need to take any action) except step 6 and 7. You need tocomplete these sections (highlighted in blue) by yourself to submit this task.Step 1: Business Understanding (Problem Definitions)Your aim is to develop a multi-class machine learning-based classification model to identifydifferent network traffic classes for TWO BENCHMARK DATASETS.Step 2: Data Gathering (Identify the source of data)In the industry/real-world, you need to communicate either with your manager, client, otherstakeholders and/or IT team to understand the source of data and to gather it.Here, the teaching team already gathered data for you. You can access the dataset fromthe given github link.In this task, you need to perform experiments on TWO DATASETS.1. The first dataset “NSL-KDD” can be obtained from the data folder, go to the“Week_5_NSL-KDD-Dataset” subfolder.2. The second dataset is “Processed Combined IoT dataset”If you are interested to learn more about the datasets, please visit the websites/links below(not mandatory for the HD task).Datset 1 descriptiom https://www.unb.ca/cic/datasets/nsl.htmlDatset 2 descriptiom https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9189760***One starting example code for the 5 class classification (Dataset 1) is also givenfor your convenience, where some of the steps are already implemented. Please seethe “SIT719_Prac05_Task02_HD_task_sample_done” notebook file (obtain from thegithub link) for Dataset 1 (NSL KDD).***Another starting example code for the second dataset (TON IoT) is also givenwhere the data has been preprocessed for you. See the link“https://github.com/adnandeakin/2020-S2/blob/master/Copy_of_RF_on_IoT_Combined_Dataset.ipynb”Step 3: Data Cleaning (Filtering anomalous data)In a typical analysis, you may need to take care of missing values and inconsistent data. Inweek 2, you have learnt how to deal with missing values and manipulate a database. Here,it has already been taken care of for this dataset (so no action is needed for thistask).Step 4: Data Exploration (Understanding the data)Some examples of data exploration are “Identification of the attribute names (Header),Checking the length of the Train and Test dataset, Checking the total number of samplesthat belong to each of the five classes of the training dataset”, etc. You don’t need to doanything here. However, these actions will help you to understand the data better inpractice.Step 5: Feature Engineering (Select Important Feature)In a typical setup, you may need to do feature extraction or selection during your dataanalysis process. Here, relevant feature engineering is already done for you in the samplecode. So, no action is needed for this taskStep 6: Predictive Modelling (Prediction of the classes) – This is the task for you.Dataset 1:The DecisionTreeClassifier has been implemented for you. Now, you need to implementother techniques and compare. Please do the following tasks:1. Implement at least 5 benchmark classification algorithms.2. Tune the parameters if applicable to obtain a good solution.3. Obtain the confusion matrix for each of the scenarios (Use the test dataset).4. Calculate the performance measures for the each of the classification algorithms thatincludes Precision (%), Recall (%), F-Score (%), False Alarm- FPR (%)You need to compare the results following the table below. Create one table for eachalgorithm (Use the test dataset).. AttackClassPrecision(%)Recall(%)……………DoSNormalProbR2LU2R Finally, you summarize the results similar to the below table (Use the test dataset).: AlgorithmsAccuracy(%)Precision(%)Recall(%)……………Alg 1Alg 2……… Dataset 2:A sample Random Forest implementation is given to you. Repeat the procedure asmentioned in datset 1. The only difference will be “you need to consider 70:30 train-test split(70% for train and 30% for test)” for testing as there is no separate test set file. Please note,k-fold cross validation is also acceptable. However, as k-fold cross validation will take ahuge amount of time, we have not made it mandatory.Comparison of Results:Your results need to be comparable against benchmark algorithms. For example, see thebelow results obtained from a recent article “An Adaptive Ensemble Machine LearningModel for Intrusion Detection” published in IEEE ACCESS, July 2019 for Dataset 1For Dataset 2, please see the article “TON_IoT Telemetry Dataset: A New GenerationDataset of IoT and IIoT for Data-Driven Intrusion Detection Systems” for Dataset 2.It will not be exactly same and nothing to be worried about that. Your target will be toselect the best performing algorithms that you can and achieve a comparable results.Step 7: Data VisualizationPerform the following tasks for both of the datasets:1. Visualize and compare the accuracy of different algorithms.2. Plot the confusion matrix for each scenarios.Step 8: Results delivery:Once you have completed the data analysis task for your security project, you need todeliver the outcome. In real-world, results are typically delivered as a product/tool/web-appor through a presentation or by submitting the report. However, in our unit we willconsider a report based submission only (PLEASE NOTE, the results obtained fromthe above steps need to be submitted as a REPORT format rather than just ascreenshot).Here, you need to write a report (at least 3500 word) based on the outcome and resultsyou obtained by performing the above steps. The report will describe the algorithmsused, their working principle, key parameters, and the results. Results shouldconsider all the key performance measures and comparative results in the form oftables, graphs, etc.Submit the PDF report through onTrack. You also need to submit – (i) the code fileand (ii) the word/source file of the REPORT separately (within the “Code for task 5.1”folder) under the assignment tab of the CloudDeakin.Please note, it is a graded task where you will receive some feedback and marks. Yourtutor/marker will assign you some marks based on the quality of your submission,performance of your algorithms, selection and novelty in your algorithm, tuning andunderstanding the algorithms, how well you have explained the results, your usageof scientific language, authenticity of the claims and finally the aesthetic look of yoursubmission and reflection of the quality of your work from the tutor’s judgement. Youwill receive the feedback based on the following marking rubric. The marker will judge howyou have performed in the following categories.Marking Rubric: CriteriaUnsatisfactory –BeginningDevelopingAccomplishedExemplaryTotalReportFocus:Purpose/PositionStatement0-7 points8-11 points12-15 points16-20 points/20Fails to clearly relatethe report topic or is notclearly defined and/orthe report lacks focusthroughout.The report is too broad inscope (outside of the titletopic) and/or the report issomewhat unclear andneeds to be developedfurther. Focal point is notconsistently maintainedthroughout the report.The report providesadequate direction withsome degree of interestfor the reader. The reportstates the position, andmaintains the focal pointof the analysis for themost part.The report providesdirection for thediscussion part of theanalysis that is engagingand thought provoking,The report clearly andconcisely states theposition, and consistentlymaintain the focal point.ComparativeanalysisandDiscussion0-15 points16-20 points21-24 points25-30 points/30Demonstrates a lack ofunderstanding andinadequate knowledgeof the topic. Analysis isvery superficial andDemonstrates generalunderstanding of pythonscripting. Analysis isgood and has addressedall criteria. ComparativeDemonstrates good levelof understanding ofpython scripting.Algorithms are fine-tunedand comprise goodDemonstrates superiorlevel of understanding ofpython scripting andalgorithms. Algorithmsare fine-tuned with somecontains flaws. Theanalysis is presented.selection of algorithms.novelty or hybridization orreport is also not clear.Sufficient discussion isComparative results areadvanced and/or recentalso presented.presented using standardalgorithm. Comparativeperformance measures.results are presentedusing performancemeasures in a way that itprovides very clear andmeaningful insights of theoutput.Organization0-6 points7-11 points12-15 points16-20 points/20Report lacks logicalReport is somewhatReport is adequatelyReport is effectivelyorganization andorganized, althoughorganized. Results areorganized. Ideas andimpedes readers’occasionally ideas fromarranged reasonably wellresults are arrangedcomprehension ofparagraph to paragraphwith a progression oflogically, flow smoothly,ideas. Central positionmay not flow well and/orthought from paragraphwith a strong progressionis rarely evident fromconnect to the centralto paragraph connectingof thought fromparagraph to paragraphposition or be clear as ato the central position ofparagraph to paragraphand/or the report iswhole. May be missing athe analysis. Includesconnecting to the centralmissing multiplerequired componentrequired components,position related to therequired components.and/or components maylike visualization, tableanalysis tasks. Includesbe less than complete.and graphs. The report isall required componentsDiscussion related towell organized and easywith supportive figures,analysis result isto follow.tables, references,presented but not verycharts/graphs, equation,clear and insightful.etc.WritingQuality&Adherence toFormatGuidelines0-10 points11-17 points18-21 points22-30 points/30Report shows a belowaverage/poor writingstyle lacking inelements of appropriatestandard English.Frequent errors inspelling, grammar,punctuation, spelling,usage, and/orformatting.Report shows an averageand/or casual writingstyle using standardEnglish. Some errors inspelling, grammar,punctuation, usage,and/or formatting.Report shows aboveaverage writing style (canbe considered good) andclarity in writing usingstandard English. Minorerrors in grammar,punctuation, spelling,Article is well written andclear and standardEnglish characterized byelements of a strongwriting style. Basicallyfree from grammar,punctuation, spelling,usage, and/or formatting.usage, or formattingAuthor has demonstratederrors.the use of scientificAuthor has demonstratedlanguage and results areadvanced use ofwell explained.scientific language andresults are well explainedwith insights. Rubric adopted from: Denise Kreiger, Instructional Design and Technology Services, SC&I, Rutgers University,4/2014