Software to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections. The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.
Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.
The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. description
A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes.
Attacks fall into four main categories:
Read the data into Pandas
import pandas as pd pd.set_option('display.max_columns', 500) import zipfile with zipfile.ZipFile('../datasets/UNB_ISCX_NSL_KDD.csv.zip', 'r') as z: f = z.open('UNB_ISCX_NSL_KDD.csv') data = pd.io.parsers.read_table(f, sep=',') data.head()
Create X and y
Use only same_srv_rate and dst_host_srv_count
y = (data['class'] == 'anomaly').astype(int)
0 77054 1 71463 Name: class, dtype: int64
X = data[['same_srv_rate','dst_host_srv_count']]
Increase sensitivity by lowering the threshold for predicting anomaly connection
Create a new classifier by changing the probability threshold to 0.3
What is the new confusion matrix?
What is the new percentage of detected anomalies?