In [144]:

import pandas as pd
import matplotlib.pyplot as plt #visualisation
import seaborn as sns #visualisation
import numpy as np 

Here I have loaded the dataset. To save myself from typing 'aps_failure.csv' every single time I have given the dataset a simplfied name 'afs'. Line 1 below tells the program where the data is while line 2 renames it for ease of use.

In [145]:

data = pd.read_csv('aps_failure_set.csv')
afs=pd.read_csv('aps_failure_set.csv')

Exploratory Analysis. I am gathering some very basic information on my datset so I know what I'm dealing with. I start this process with gathering basic information

In [146]:

afs.shape

Out[146]:

(60000, 171)

The afs.shape above has told me I am dealing with a datset that has 171 columns and 60,000 rows. I will now use the afs.describe(include=object) function to provide me with some basic statistics on the data. This is useful for the following reasons:

-Count shows me that -Unique showes me that -Top shows me that -Freq shows me that

In [147]:

afs.describe(include=object)

Out[147]:

	class	ab_000	ac_000	ad_000	ae_000	af_000	ag_000	ag_001	ag_002	ag_003	...	ee_002	ee_003	ee_004	ee_005	ee_006	ee_007	ee_008	ee_009	ef_000	eg_000
count	60000	60000	60000	60000	60000	60000	60000	60000	60000	60000	...	60000	60000	60000	60000	60000	60000	60000	60000	60000	60000
unique	2	30	2062	1887	334	419	155	618	2423	7880	...	34489	31712	35189	36289	31796	30470	24214	9725	29	50
top	neg	na	0	na	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
freq	59000	46329	8752	14861	55543	55476	59133	58587	56181	46894	...	1364	1557	1797	2814	4458	7898	17280	31863	57021	56794

4 rows × 170 columns

Now I begin to view the data. data.head(10) gives me the first 10 rows of the data.

This allows me to get an understanding of what I am actually dealing with. It is a good way

In [148]:

# Display the first 5 records
afs.head(5)

Out[148]:

	class	aa_000	ab_000	ac_000	ad_000	af_000	...	ee_002	ee_003	ee_004	ee_005	ee_006	ee_007	ee_008	ee_009	ef_000	eg_000
0	neg	76698	na	2130706438	280	0	...	1240520	493384	721044	469792	339156	157956	73224	0	0	0
1	neg	33058	na	0	na	0	...	421400	178064	293306	245416	133654	81140	97576	1500	0	0
2	neg	41040	na	228	100	0	...	277378	159812	423992	409564	320746	158022	95128	514	0	0
3	neg	12	0	70	66	10	...	240	46	58	44	10	0	0	0	4	32
4	neg	60874	na	1368	458	0	...	622012	229790	405298	347188	286954	311560	433954	1218	0	0

5 rows × 171 columns

In [149]:

afs.tail(5)

Out[149]:

	class	aa_000	ab_000	ac_000	ad_000	...	ee_002	ee_003	ee_004	ee_005	ee_006	ee_007	ee_008	ee_009
59995	neg	153002	na	664	186	...	998500	566884	1290398	1218244	1019768	717762	898642	28588
59996	neg	2286	na	2130706538	224	...	10578	6760	21126	68424	136	0	0	0
59997	neg	112	0	2130706432	18	...	792	386	452	144	146	2622	0	0
59998	neg	80292	na	2130706432	494	...	699352	222654	347378	225724	194440	165070	802280	388422
59999	neg	40222	na	698	628	...	440066	183200	344546	254068	225148	158304	170384	158

5 rows × 171 columns

In [150]:

print(afs.columns)

Index(['class', 'aa_000', 'ab_000', 'ac_000', 'ad_000', 'ae_000', 'af_000',
       'ag_000', 'ag_001', 'ag_002',
       ...
       'ee_002', 'ee_003', 'ee_004', 'ee_005', 'ee_006', 'ee_007', 'ee_008',
       'ee_009', 'ef_000', 'eg_000'],
      dtype='object', length=171)

Checking the data type

In [151]:

afs.info

Out[151]:

<bound method DataFrame.info of       class  aa_000 ab_000      ac_000 ad_000 ae_000 af_000 ag_000 ag_001  \
0       neg   76698     na  2130706438    280      0      0      0      0   
1       neg   33058     na           0     na      0      0      0      0   
2       neg   41040     na         228    100      0      0      0      0   
3       neg      12      0          70     66      0     10      0      0   
4       neg   60874     na        1368    458      0      0      0      0   
...     ...     ...    ...         ...    ...    ...    ...    ...    ...   
59995   neg  153002     na         664    186      0      0      0      0   
59996   neg    2286     na  2130706538    224      0      0      0      0   
59997   neg     112      0  2130706432     18      0      0      0      0   
59998   neg   80292     na  2130706432    494      0      0      0      0   
59999   neg   40222     na         698    628      0      0      0      0   

      ag_002  ...   ee_002  ee_003   ee_004   ee_005   ee_006  ee_007  ee_008  \
0          0  ...  1240520  493384   721044   469792   339156  157956   73224   
1          0  ...   421400  178064   293306   245416   133654   81140   97576   
2          0  ...   277378  159812   423992   409564   320746  158022   95128   
3          0  ...      240      46       58       44       10       0       0   
4          0  ...   622012  229790   405298   347188   286954  311560  433954   
...      ...  ...      ...     ...      ...      ...      ...     ...     ...   
59995      0  ...   998500  566884  1290398  1218244  1019768  717762  898642   
59996      0  ...    10578    6760    21126    68424      136       0       0   
59997      0  ...      792     386      452      144      146    2622       0   
59998      0  ...   699352  222654   347378   225724   194440  165070  802280   
59999      0  ...   440066  183200   344546   254068   225148  158304  170384   

       ee_009 ef_000 eg_000  
0           0      0      0  
1        1500      0      0  
2         514      0      0  
3           0      4     32  
4        1218      0      0  
...       ...    ...    ...  
59995   28588      0      0  
59996       0      0      0  
59997       0      0      0  
59998  388422      0      0  
59999     158      0      0  

[60000 rows x 171 columns]>

In [ ]:

As the 'neg' column is not applicable to this project, I will remove them from the data set before I explore any further. "The dataset’s positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS." This data is unrelated and therefore not useful for my project. Firstly, I will change the data

In [ ]:

COoe source: https://www.w3docs.com/snippets/python/deleting-dataframe-row-in-pandas-based-on-column-value.html

In [ ]:

In [164]:

afs = afs.drop(afs[afs['class'] == 'neg'].index)

In [165]:

In [168]:

print("Negative count:", neg_count)

Negative count: 0

This confirms that all 'neg' values have been dropped. Source: - See method 2 'Using the drop function' https://saturncloud.io/blog/how-to-remove-rows-with-specific-values-in-pandas-dataframe/#:~:text=Another%20method%20to%20remove%20rows,value%20we%20want%20to%20remove

In [156]:

afs.describe(include=object)
print(afs)

      class  aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000  ag_001  \
9       pos  153204      0    182     na      0      0      0       0   
23      pos  453236     na   2926     na      0      0      0       0   
60      pos   72504     na   1594   1052      0      0      0     244   
115     pos  762958     na     na     na     na     na    776  281128   
135     pos  695994     na     na     na     na     na      0       0   
...     ...     ...    ...    ...    ...    ...    ...    ...     ...   
59484   pos  895178     na     na     na     na     na      0       0   
59601   pos  862134     na     na     na     na     na      0   38834   
59692   pos  186856     na     na     na      0      0      0       0   
59742   pos  605092     na     na     na     na     na      0   44320   
59769   pos  331704     na   1484   1142      0      0      0  267100   

        ag_002  ...   ee_002   ee_003   ee_004   ee_005   ee_006    ee_007  \
9            0  ...   129862    26872    34044    22472    34362         0   
23         222  ...  7908038  3026002  5025350  2025766  1160638    533834   
60      178226  ...  1432098   372252   527514   358274   332818    284178   
115    2186308  ...       na       na       na       na       na        na   
135          0  ...  1397742   495544   361646    28610     5130       212   
...        ...  ...      ...      ...      ...      ...      ...       ...   
59484        0  ...  9116224  4276644  8701496  8082264  5827284   2057354   
59601  1227952  ...  3456564  1793170  4159190  5847384  8364506  12875424   
59692     4300  ...  2713108   800182   322322    71638    34662      7304   
59742  1048970  ...  3940400  1865730  3698692  3271958  9831898   3755392   
59769  1384372  ...  3738648  1425312  3381954  4346910  2166330    296580   

        ee_008 ee_009 ef_000 eg_000  
9            0      0      0      0  
23      493800   6914      0      0  
60        3742      0      0      0  
115         na     na     na     na  
135          0      0     na     na  
...        ...    ...    ...    ...  
59484  1662302  10790     na     na  
59601   661442   2458     na     na  
59692     2538      0      0      0  
59742    65610      0     na     na  
59769    15434      0      0      0  

[1000 rows x 171 columns]

Above I have ran the neg_count function to ensure that the negitive values were dropped. I then ran the describe function to confirm that the value of "class" is now "1" instead of two. Source: https://www.w3docs.com/snippets/python/deleting-dataframe-row-in-pandas-based-on-column-value.html

I need to get rid of the n/a in the ab_00 column. Below I will experiment with different strategies to do this. Forst, I will explore the classification and regression of the dataset. For this project, I will use multiclass classification.

In [157]:

afs.shape

Out[157]:

(1000, 171)

In [158]:

#Requesting basic info on the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 171 entries, class to eg_000
dtypes: int64(1), object(170)
memory usage: 78.3+ MB

Basic Statistical Information on the dataset

In [159]:

data.describe()

Out[159]:

	aa_000
count	6.000000e+04
mean	5.933650e+04
std	1.454301e+05
min	0.000000e+00
25%	8.340000e+02
50%	3.077600e+04
75%	4.866800e+04
max	2.746564e+06

I am checking the code for blank data. Note for myself- add in why this is important from lecture notes

In [160]:

afs.isnull().sum()

Out[160]:

class     0
aa_000    0
ab_000    0
ac_000    0
ad_000    0
         ..
ee_007    0
ee_008    0
ee_009    0
ef_000    0
eg_000    0
Length: 171, dtype: int64

In [161]:

print(afs.isnull().values.any())

False

In [162]:

afs["class"].value_counts().sort_index()

Out[162]:

pos    1000
Name: class, dtype: int64

I have notcied my first issue with the data. I have 59,000 data points for negative and only 1000 for posiitve. The code column contains two attributes, negative (neg) and positive (pos). I will change these to numercal values so I can count them. neg=0, pos=1... Source: https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline.html

In [ ]:

In [163]:

# Displaying the first few rows of the 'class' column and its distribution
(data['class'].head(), data['class'].value_counts(normalize=True))

Out[163]:

(0    neg
 1    neg
 2    neg
 3    neg
 4    neg
 Name: class, dtype: object,
 neg    0.983333
 pos    0.016667
 Name: class, dtype: float64)

This indicates that my dataset has no missing values or invalid data types. This is a good sign as my data is 'complete' and no further action is required.

Now I will begin to visualise my data using seaborne.

In [ ]: