import pandas as pd
import matplotlib.pyplot as plt #visualisation
import seaborn as sns #visualisation
import numpy as np
Step One- Data Wrangling
1.1 Data Loading
Here I have loaded the dataset. To save myself from typing 'aps_failure.csv' every single time I have given the dataset a simplfied name 'afs'. Line 1 below tells the program where the data is while line 2 renames it for ease of use.
data = pd.read_csv('aps_failure_set.csv')
afs=pd.read_csv('aps_failure_set.csv')
Exploratory Analysis. I am gathering some very basic information on my datset so I know what I'm dealing with. I start this process with gathering basic information
afs.shape
(60000, 171)
The afs.shape above has told me I am dealing with a datset that has 171 columns and 60,000 rows. I will now use the afs.describe(include=object) function to provide me with some basic statistics on the data. This is useful for the following reasons:
-Count shows me that -Unique showes me that -Top shows me that -Freq shows me that
afs.head(5)
class | aa_000 | ab_000 | ac_000 | ad_000 | ae_000 | af_000 | ag_000 | ag_001 | ag_002 | ... | ee_002 | ee_003 | ee_004 | ee_005 | ee_006 | ee_007 | ee_008 | ee_009 | ef_000 | eg_000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | neg | 76698 | na | 2130706438 | 280 | 0 | 0 | 0 | 0 | 0 | ... | 1240520 | 493384 | 721044 | 469792 | 339156 | 157956 | 73224 | 0 | 0 | 0 |
1 | neg | 33058 | na | 0 | na | 0 | 0 | 0 | 0 | 0 | ... | 421400 | 178064 | 293306 | 245416 | 133654 | 81140 | 97576 | 1500 | 0 | 0 |
2 | neg | 41040 | na | 228 | 100 | 0 | 0 | 0 | 0 | 0 | ... | 277378 | 159812 | 423992 | 409564 | 320746 | 158022 | 95128 | 514 | 0 | 0 |
3 | neg | 12 | 0 | 70 | 66 | 0 | 10 | 0 | 0 | 0 | ... | 240 | 46 | 58 | 44 | 10 | 0 | 0 | 0 | 4 | 32 |
4 | neg | 60874 | na | 1368 | 458 | 0 | 0 | 0 | 0 | 0 | ... | 622012 | 229790 | 405298 | 347188 | 286954 | 311560 | 433954 | 1218 | 0 | 0 |
5 rows × 171 columns
Now I begin to view the data. data.head(10) gives me the first 10 rows of the data.
This allows me to get an understanding of what I am actually dealing with. It is a good way
afs.tail(5)
class | aa_000 | ab_000 | ac_000 | ad_000 | ae_000 | af_000 | ag_000 | ag_001 | ag_002 | ... | ee_002 | ee_003 | ee_004 | ee_005 | ee_006 | ee_007 | ee_008 | ee_009 | ef_000 | eg_000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
59995 | neg | 153002 | na | 664 | 186 | 0 | 0 | 0 | 0 | 0 | ... | 998500 | 566884 | 1290398 | 1218244 | 1019768 | 717762 | 898642 | 28588 | 0 | 0 |
59996 | neg | 2286 | na | 2130706538 | 224 | 0 | 0 | 0 | 0 | 0 | ... | 10578 | 6760 | 21126 | 68424 | 136 | 0 | 0 | 0 | 0 | 0 |
59997 | neg | 112 | 0 | 2130706432 | 18 | 0 | 0 | 0 | 0 | 0 | ... | 792 | 386 | 452 | 144 | 146 | 2622 | 0 | 0 | 0 | 0 |
59998 | neg | 80292 | na | 2130706432 | 494 | 0 | 0 | 0 | 0 | 0 | ... | 699352 | 222654 | 347378 | 225724 | 194440 | 165070 | 802280 | 388422 | 0 | 0 |
59999 | neg | 40222 | na | 698 | 628 | 0 | 0 | 0 | 0 | 0 | ... | 440066 | 183200 | 344546 | 254068 | 225148 | 158304 | 170384 | 158 | 0 | 0 |
5 rows × 171 columns
afs.shape
(60000, 171)
afs.info
<bound method DataFrame.info of class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 \ 0 neg 76698 na 2130706438 280 0 0 0 0 1 neg 33058 na 0 na 0 0 0 0 2 neg 41040 na 228 100 0 0 0 0 3 neg 12 0 70 66 0 10 0 0 4 neg 60874 na 1368 458 0 0 0 0 ... ... ... ... ... ... ... ... ... ... 59995 neg 153002 na 664 186 0 0 0 0 59996 neg 2286 na 2130706538 224 0 0 0 0 59997 neg 112 0 2130706432 18 0 0 0 0 59998 neg 80292 na 2130706432 494 0 0 0 0 59999 neg 40222 na 698 628 0 0 0 0 ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 \ 0 0 ... 1240520 493384 721044 469792 339156 157956 73224 1 0 ... 421400 178064 293306 245416 133654 81140 97576 2 0 ... 277378 159812 423992 409564 320746 158022 95128 3 0 ... 240 46 58 44 10 0 0 4 0 ... 622012 229790 405298 347188 286954 311560 433954 ... ... ... ... ... ... ... ... ... ... 59995 0 ... 998500 566884 1290398 1218244 1019768 717762 898642 59996 0 ... 10578 6760 21126 68424 136 0 0 59997 0 ... 792 386 452 144 146 2622 0 59998 0 ... 699352 222654 347378 225724 194440 165070 802280 59999 0 ... 440066 183200 344546 254068 225148 158304 170384 ee_009 ef_000 eg_000 0 0 0 0 1 1500 0 0 2 514 0 0 3 0 4 32 4 1218 0 0 ... ... ... ... 59995 28588 0 0 59996 0 0 0 59997 0 0 0 59998 388422 0 0 59999 158 0 0 [60000 rows x 171 columns]>
afs.columns
Index(['class', 'aa_000', 'ab_000', 'ac_000', 'ad_000', 'ae_000', 'af_000', 'ag_000', 'ag_001', 'ag_002', ... 'ee_002', 'ee_003', 'ee_004', 'ee_005', 'ee_006', 'ee_007', 'ee_008', 'ee_009', 'ef_000', 'eg_000'], dtype='object', length=171)
null_values = afs.isnull().sum()
Checking the data type
afs.describe
<bound method NDFrame.describe of class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 \ 0 neg 76698 na 2130706438 280 0 0 0 0 1 neg 33058 na 0 na 0 0 0 0 2 neg 41040 na 228 100 0 0 0 0 3 neg 12 0 70 66 0 10 0 0 4 neg 60874 na 1368 458 0 0 0 0 ... ... ... ... ... ... ... ... ... ... 59995 neg 153002 na 664 186 0 0 0 0 59996 neg 2286 na 2130706538 224 0 0 0 0 59997 neg 112 0 2130706432 18 0 0 0 0 59998 neg 80292 na 2130706432 494 0 0 0 0 59999 neg 40222 na 698 628 0 0 0 0 ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 \ 0 0 ... 1240520 493384 721044 469792 339156 157956 73224 1 0 ... 421400 178064 293306 245416 133654 81140 97576 2 0 ... 277378 159812 423992 409564 320746 158022 95128 3 0 ... 240 46 58 44 10 0 0 4 0 ... 622012 229790 405298 347188 286954 311560 433954 ... ... ... ... ... ... ... ... ... ... 59995 0 ... 998500 566884 1290398 1218244 1019768 717762 898642 59996 0 ... 10578 6760 21126 68424 136 0 0 59997 0 ... 792 386 452 144 146 2622 0 59998 0 ... 699352 222654 347378 225724 194440 165070 802280 59999 0 ... 440066 183200 344546 254068 225148 158304 170384 ee_009 ef_000 eg_000 0 0 0 0 1 1500 0 0 2 514 0 0 3 0 4 32 4 1218 0 0 ... ... ... ... 59995 28588 0 0 59996 0 0 0 59997 0 0 0 59998 388422 0 0 59999 158 0 0 [60000 rows x 171 columns]>
nan_afs=afs.isna()
print(nan_afs)
class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 \ 0 False False False False False False False False False 1 False False False False False False False False False 2 False False False False False False False False False 3 False False False False False False False False False 4 False False False False False False False False False ... ... ... ... ... ... ... ... ... ... 59995 False False False False False False False False False 59996 False False False False False False False False False 59997 False False False False False False False False False 59998 False False False False False False False False False 59999 False False False False False False False False False ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 \ 0 False ... False False False False False False False 1 False ... False False False False False False False 2 False ... False False False False False False False 3 False ... False False False False False False False 4 False ... False False False False False False False ... ... ... ... ... ... ... ... ... ... 59995 False ... False False False False False False False 59996 False ... False False False False False False False 59997 False ... False False False False False False False 59998 False ... False False False False False False False 59999 False ... False False False False False False False ee_009 ef_000 eg_000 0 False False False 1 False False False 2 False False False 3 False False False 4 False False False ... ... ... ... 59995 False False False 59996 False False False 59997 False False False 59998 False False False 59999 False False False [60000 rows x 171 columns]
nan_rows = afs.isna().any(axis=1)
print(nan_rows)
0 False 1 False 2 False 3 False 4 False ... 59995 False 59996 False 59997 False 59998 False 59999 False Length: 60000, dtype: bool
clean_afs = afs.dropna()
print(clean_afs)
class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 \ 0 neg 76698 na 2130706438 280 0 0 0 0 1 neg 33058 na 0 na 0 0 0 0 2 neg 41040 na 228 100 0 0 0 0 3 neg 12 0 70 66 0 10 0 0 4 neg 60874 na 1368 458 0 0 0 0 ... ... ... ... ... ... ... ... ... ... 59995 neg 153002 na 664 186 0 0 0 0 59996 neg 2286 na 2130706538 224 0 0 0 0 59997 neg 112 0 2130706432 18 0 0 0 0 59998 neg 80292 na 2130706432 494 0 0 0 0 59999 neg 40222 na 698 628 0 0 0 0 ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 \ 0 0 ... 1240520 493384 721044 469792 339156 157956 73224 1 0 ... 421400 178064 293306 245416 133654 81140 97576 2 0 ... 277378 159812 423992 409564 320746 158022 95128 3 0 ... 240 46 58 44 10 0 0 4 0 ... 622012 229790 405298 347188 286954 311560 433954 ... ... ... ... ... ... ... ... ... ... 59995 0 ... 998500 566884 1290398 1218244 1019768 717762 898642 59996 0 ... 10578 6760 21126 68424 136 0 0 59997 0 ... 792 386 452 144 146 2622 0 59998 0 ... 699352 222654 347378 225724 194440 165070 802280 59999 0 ... 440066 183200 344546 254068 225148 158304 170384 ee_009 ef_000 eg_000 0 0 0 0 1 1500 0 0 2 514 0 0 3 0 4 32 4 1218 0 0 ... ... ... ... 59995 28588 0 0 59996 0 0 0 59997 0 0 0 59998 388422 0 0 59999 158 0 0 [60000 rows x 171 columns]
I have noticed some data points say Na. I have asked how many are like this using code I found at this source: https://saturncloud.io/blog/how-to-find-all-rows-with-nan-values-in-python-pandas/#:~:text=To%20find%20all%20rows%20with%20NaN%20values%20in%20a%20Pandas,where%20NaN%20values%20are%20present.
Adding new column names as the current naming of the sensor columns is confusing. Source:
print(afs.columns)
Index(['class', 'aa_000', 'ab_000', 'ac_000', 'ad_000', 'ae_000', 'af_000', 'ag_000', 'ag_001', 'ag_002', ... 'ee_002', 'ee_003', 'ee_004', 'ee_005', 'ee_006', 'ee_007', 'ee_008', 'ee_009', 'ef_000', 'eg_000'], dtype='object', length=171)
afs.dtypes
class object aa_000 int64 ab_000 object ac_000 object ad_000 object ... ee_007 object ee_008 object ee_009 object ef_000 object eg_000 object Length: 171, dtype: object
def missing_values_table(afs):
mis_val = afs.apply(lambda x: x[x == 'na'].count(), axis=0)
mis_val_percent = 100 * mis_val / len(afs)
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'}
)
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0
].sort_values('% of Total Values', ascending=False).round(1)
print("Your selected dataframe has " + str(afs.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
return mis_val_table_ren_columns
missing_values_table(afs)
Your selected dataframe has 171 columns. There are 169 columns that have missing values.
Missing Values | % of Total Values | |
---|---|---|
br_000 | 49264 | 82.1 |
bq_000 | 48722 | 81.2 |
bp_000 | 47740 | 79.6 |
bo_000 | 46333 | 77.2 |
ab_000 | 46329 | 77.2 |
cr_000 | 46329 | 77.2 |
bn_000 | 44009 | 73.3 |
bm_000 | 39549 | 65.9 |
bl_000 | 27277 | 45.5 |
bk_000 | 23034 | 38.4 |
ch_000 | 14861 | 24.8 |
co_000 | 14861 | 24.8 |
cf_000 | 14861 | 24.8 |
cg_000 | 14861 | 24.8 |
ad_000 | 14861 | 24.8 |
db_000 | 13808 | 23.0 |
ct_000 | 13808 | 23.0 |
cu_000 | 13808 | 23.0 |
cv_000 | 13808 | 23.0 |
cx_000 | 13808 | 23.0 |
cy_000 | 13808 | 23.0 |
cz_000 | 13808 | 23.0 |
da_000 | 13808 | 23.0 |
dc_000 | 13808 | 23.0 |
ec_00 | 10239 | 17.1 |
cm_000 | 9877 | 16.5 |
ed_000 | 9553 | 15.9 |
cl_000 | 9553 | 15.9 |
ak_000 | 4400 | 7.3 |
ca_000 | 4356 | 7.3 |
dm_000 | 4009 | 6.7 |
dh_000 | 4008 | 6.7 |
dg_000 | 4008 | 6.7 |
df_000 | 4008 | 6.7 |
dl_000 | 4008 | 6.7 |
dj_000 | 4007 | 6.7 |
dk_000 | 4007 | 6.7 |
eb_000 | 4007 | 6.7 |
di_000 | 4006 | 6.7 |
ac_000 | 3335 | 5.6 |
bx_000 | 3257 | 5.4 |
cc_000 | 3255 | 5.4 |
ds_000 | 2727 | 4.5 |
bd_000 | 2727 | 4.5 |
dt_000 | 2727 | 4.5 |
du_000 | 2726 | 4.5 |
dv_000 | 2726 | 4.5 |
dp_000 | 2726 | 4.5 |
dq_000 | 2726 | 4.5 |
dr_000 | 2726 | 4.5 |
bc_000 | 2725 | 4.5 |
dy_000 | 2724 | 4.5 |
do_000 | 2724 | 4.5 |
de_000 | 2724 | 4.5 |
ef_000 | 2724 | 4.5 |
cp_000 | 2724 | 4.5 |
dx_000 | 2723 | 4.5 |
dz_000 | 2723 | 4.5 |
ea_000 | 2723 | 4.5 |
bz_000 | 2723 | 4.5 |
eg_000 | 2723 | 4.5 |
ar_000 | 2723 | 4.5 |
be_000 | 2503 | 4.2 |
dd_000 | 2503 | 4.2 |
ce_000 | 2502 | 4.2 |
ax_000 | 2501 | 4.2 |
ae_000 | 2500 | 4.2 |
af_000 | 2500 | 4.2 |
bf_000 | 2500 | 4.2 |
av_000 | 2500 | 4.2 |
bs_000 | 726 | 1.2 |
cb_000 | 726 | 1.2 |
dn_000 | 691 | 1.2 |
cq_000 | 691 | 1.2 |
bv_000 | 691 | 1.2 |
bu_000 | 691 | 1.2 |
ba_009 | 688 | 1.1 |
ba_003 | 688 | 1.1 |
ba_007 | 688 | 1.1 |
ba_000 | 688 | 1.1 |
ba_004 | 688 | 1.1 |
ba_006 | 688 | 1.1 |
ba_005 | 688 | 1.1 |
ba_002 | 688 | 1.1 |
ba_001 | 688 | 1.1 |
ba_008 | 688 | 1.1 |
cn_007 | 687 | 1.1 |
cn_006 | 687 | 1.1 |
cn_005 | 687 | 1.1 |
cn_008 | 687 | 1.1 |
cn_004 | 687 | 1.1 |
cn_003 | 687 | 1.1 |
cn_001 | 687 | 1.1 |
cn_000 | 687 | 1.1 |
cn_002 | 687 | 1.1 |
cn_009 | 687 | 1.1 |
cd_000 | 676 | 1.1 |
ag_004 | 671 | 1.1 |
ag_005 | 671 | 1.1 |
ag_006 | 671 | 1.1 |
ag_003 | 671 | 1.1 |
ag_008 | 671 | 1.1 |
ag_009 | 671 | 1.1 |
ay_000 | 671 | 1.1 |
ay_001 | 671 | 1.1 |
ay_002 | 671 | 1.1 |
ay_003 | 671 | 1.1 |
ag_007 | 671 | 1.1 |
az_009 | 671 | 1.1 |
ag_002 | 671 | 1.1 |
ag_001 | 671 | 1.1 |
ag_000 | 671 | 1.1 |
ee_000 | 671 | 1.1 |
ee_001 | 671 | 1.1 |
ee_002 | 671 | 1.1 |
ee_003 | 671 | 1.1 |
ee_004 | 671 | 1.1 |
ee_005 | 671 | 1.1 |
ee_006 | 671 | 1.1 |
ee_007 | 671 | 1.1 |
ee_008 | 671 | 1.1 |
ee_009 | 671 | 1.1 |
ay_004 | 671 | 1.1 |
ay_005 | 671 | 1.1 |
ay_006 | 671 | 1.1 |
az_003 | 671 | 1.1 |
ay_007 | 671 | 1.1 |
ay_008 | 671 | 1.1 |
ay_009 | 671 | 1.1 |
az_000 | 671 | 1.1 |
az_001 | 671 | 1.1 |
az_002 | 671 | 1.1 |
az_004 | 671 | 1.1 |
az_005 | 671 | 1.1 |
az_006 | 671 | 1.1 |
az_007 | 671 | 1.1 |
az_008 | 671 | 1.1 |
cs_009 | 669 | 1.1 |
cs_000 | 669 | 1.1 |
cs_001 | 669 | 1.1 |
cs_003 | 669 | 1.1 |
cs_004 | 669 | 1.1 |
cs_002 | 669 | 1.1 |
cs_008 | 669 | 1.1 |
cs_007 | 669 | 1.1 |
cs_006 | 669 | 1.1 |
cs_005 | 669 | 1.1 |
ah_000 | 645 | 1.1 |
bb_000 | 645 | 1.1 |
ap_000 | 642 | 1.1 |
bh_000 | 642 | 1.1 |
al_000 | 642 | 1.1 |
bg_000 | 642 | 1.1 |
an_000 | 642 | 1.1 |
at_000 | 629 | 1.0 |
as_000 | 629 | 1.0 |
aj_000 | 629 | 1.0 |
ai_000 | 629 | 1.0 |
am_0 | 629 | 1.0 |
au_000 | 629 | 1.0 |
bj_000 | 589 | 1.0 |
aq_000 | 589 | 1.0 |
ao_000 | 589 | 1.0 |
bi_000 | 589 | 1.0 |
by_000 | 473 | 0.8 |
ci_000 | 338 | 0.6 |
cj_000 | 338 | 0.6 |
ck_000 | 338 | 0.6 |
bt_000 | 167 | 0.3 |
Add in source for the above:
I have decided that anything with less then 60% readable data is not useable so I will cut anything that is under 60% usuable data. This includes:
br_000 bq_000 bp_000 bo_000 ab_000 cr_000 bn_000 bm_000 bl_000 bk_000
afs.drop(columns=['br_000', 'bq_000', 'bp_000', 'bo_000', 'ab_000', 'cr_000', 'bn_000', 'bm_000', 'bl_000', 'bk_000'], inplace=True)
print(afs.columns)
Index(['class', 'aa_000', 'ac_000', 'ad_000', 'ae_000', 'af_000', 'ag_000', 'ag_001', 'ag_002', 'ag_003', ... 'ee_002', 'ee_003', 'ee_004', 'ee_005', 'ee_006', 'ee_007', 'ee_008', 'ee_009', 'ef_000', 'eg_000'], dtype='object', length=161)
pd.set_option('display.max_rows', None)
new_column_names = {
"aa_000": "sensor1_reading",
"ab_000": "sensor2_reading",
"ac_000": "sensor3_reading",
"ad_000": "sensor4_reading",
"ae_000": "sensor5_reading",
"af_000": "sensor6_reading",
"ag_000": "sensor7_reading",
"ag_001": "sensor8_reading",
"ag_002": "sensor9_reading",
"ee_002": "sensor10_reading",
"ee_003": "sensor11_reading",
"ee_004": "sensor12_reading",
"ee_005": "sensor13_reading",
"ee_006": "sensor14_reading",
"ee_007": "sensor15_reading",
"ee_008": "sensor16_reading",
"ee_009": "sensor17_reading",
"ef_000": "sensor18_reading",
"eg_000": "sensor19_reading"
}
I have notcied my first issue with the data. I have 59,000 data points for negative and only 1000 for posiitve. The code column contains two attributes, negative (neg) and positive (pos). I will change these to numercal values so I can count them. neg=0, pos=1... Source: https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline.htmlAs the 'neg' column is not applicable to this project, I will remove them from the data set before I explore any further. "The dataset’s positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS." This data is unrelated and therefore not useful for my project. Firstly, I will change the data
afs = afs.drop(afs[afs['class'] == 'neg'].index)
afs.describe(include=object)
class | ab_000 | ac_000 | ad_000 | ae_000 | af_000 | ag_000 | ag_001 | ag_002 | ag_003 | ... | ee_002 | ee_003 | ee_004 | ee_005 | ee_006 | ee_007 | ee_008 | ee_009 | ef_000 | eg_000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 |
unique | 1 | 11 | 340 | 265 | 16 | 16 | 84 | 349 | 492 | 694 | ... | 873 | 871 | 871 | 866 | 865 | 859 | 845 | 370 | 9 | 10 |
top | pos | na | na | na | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
freq | 1000 | 771 | 462 | 645 | 645 | 645 | 899 | 621 | 461 | 232 | ... | 44 | 46 | 46 | 47 | 48 | 52 | 61 | 482 | 614 | 612 |
4 rows × 170 columns
This confirms that all 'neg' values have been dropped. Source: - See method 2 'Using the drop function' https://saturncloud.io/blog/how-to-remove-rows-with-specific-values-in-pandas-dataframe/#:~:text=Another%20method%20to%20remove%20rows,value%20we%20want%20to%20remove
afs.describe(include=object)
print(afs)
class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 \ 9 pos 153204 0 182 na 0 0 0 0 23 pos 453236 na 2926 na 0 0 0 0 60 pos 72504 na 1594 1052 0 0 0 244 115 pos 762958 na na na na na 776 281128 135 pos 695994 na na na na na 0 0 ... ... ... ... ... ... ... ... ... ... 59484 pos 895178 na na na na na 0 0 59601 pos 862134 na na na na na 0 38834 59692 pos 186856 na na na 0 0 0 0 59742 pos 605092 na na na na na 0 44320 59769 pos 331704 na 1484 1142 0 0 0 267100 ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 \ 9 0 ... 129862 26872 34044 22472 34362 0 23 222 ... 7908038 3026002 5025350 2025766 1160638 533834 60 178226 ... 1432098 372252 527514 358274 332818 284178 115 2186308 ... na na na na na na 135 0 ... 1397742 495544 361646 28610 5130 212 ... ... ... ... ... ... ... ... ... 59484 0 ... 9116224 4276644 8701496 8082264 5827284 2057354 59601 1227952 ... 3456564 1793170 4159190 5847384 8364506 12875424 59692 4300 ... 2713108 800182 322322 71638 34662 7304 59742 1048970 ... 3940400 1865730 3698692 3271958 9831898 3755392 59769 1384372 ... 3738648 1425312 3381954 4346910 2166330 296580 ee_008 ee_009 ef_000 eg_000 9 0 0 0 0 23 493800 6914 0 0 60 3742 0 0 0 115 na na na na 135 0 0 na na ... ... ... ... ... 59484 1662302 10790 na na 59601 661442 2458 na na 59692 2538 0 0 0 59742 65610 0 na na 59769 15434 0 0 0 [1000 rows x 171 columns]
Above I have ran the neg_count function to ensure that the negitive values were dropped. I then ran the describe function to confirm that the value of "class" is now "1" instead of two. Source: https://www.w3docs.com/snippets/python/deleting-dataframe-row-in-pandas-based-on-column-value.html
I need to get rid of the n/a in the ab_00 column. Below I will experiment with different strategies to do this. Forst, I will explore the classification and regression of the dataset. For this project, I will use multiclass classification.
afs.shape
(1000, 171)
#Requesting basic info on the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 60000 entries, 0 to 59999 Columns: 171 entries, class to eg_000 dtypes: int64(1), object(170) memory usage: 78.3+ MB
Basic Statistical Information on the dataset
data.describe()
aa_000 | |
---|---|
count | 6.000000e+04 |
mean | 5.933650e+04 |
std | 1.454301e+05 |
min | 0.000000e+00 |
25% | 8.340000e+02 |
50% | 3.077600e+04 |
75% | 4.866800e+04 |
max | 2.746564e+06 |
I am checking the code for blank data. Note for myself- add in why this is important from lecture notes
afs.isnull().sum()
class 0 aa_000 0 ab_000 0 ac_000 0 ad_000 0 .. ee_007 0 ee_008 0 ee_009 0 ef_000 0 eg_000 0 Length: 171, dtype: int64
print(afs.isnull().values.any())
False
afs["class"].value_counts().sort_index()
pos 1000 Name: class, dtype: int64
I have notcied my first issue with the data. I have 59,000 data points for negative and only 1000 for posiitve. The code column contains two attributes, negative (neg) and positive (pos). I will change these to numercal values so I can count them. neg=0, pos=1... Source: https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline.html
# Displaying the first few rows of the 'class' column and its distribution
(data['class'].head(), data['class'].value_counts(normalize=True))
(0 neg 1 neg 2 neg 3 neg 4 neg Name: class, dtype: object, neg 0.983333 pos 0.016667 Name: class, dtype: float64)
This indicates that my dataset has no missing values or invalid data types. This is a good sign as my data is 'complete' and no further action is required.
Now I will begin to visualise my data using seaborne.