#!/usr/bin/env python # coding: utf-8 # 1. Library Imports; # In this code snippet we import libraries to assist with data analysis and visualization. We use the pandas library, for data manipulation and analysis matplotlib.pyplot and seaborn for creating visualization, numpy for numerical operations and array handling warnings to suppress any unnecessary warnings during the analysis process. # # In[87]: import pandas as pd import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') import seaborn as sns from io import StringIO import numpy as np import warnings from mpl_toolkits.mplot3d import Axes3D warnings.filterwarnings('ignore') afs=pd.read_csv('aps_failure_set.csv') # Beginning my Exploratory Data Analysis, starting with inspecting the header with (20) to give me the first 20 rows of data. # In[88]: afs.head(20) # Inspecting the last five rows to give me a brief overview of the dataset.With 171 features, my first impression was that there is a lot of noise in this data set and the pre-processing and prepartion phase of this project will be key to my success. I have also noticed that my data conatins na values, which I will need to remove or change as part of my preprocessing. # # To gain an understanding of the data we display the 20 rows of the dataframe using the afs.head(20) command, afs.tail(This will display the most recent 5 rows in the dataframe) and to iover view tehd atset I used afs.info() # In[89]: afs.tail() # In[90]: afs.info() # The assignment stated that: # # "The dataset’s positive class consists of component failures for a specific component of the APS system.The negative class consists of trucks with failures for components not related to the APS. The data consists of a subset of all available data, selected by experts. This analysis will help determine the investment strategy for the company in the upcoming year." # # Therefore, my first act is to delete the unneccessary 'negative class' contained in the dataset as it is not relevent. # # df_filtered = df[df['class'] != 'neg'] # # df_filtered.to_csv('afs-filtered.csv', index=False) # # df_filtered.head(100) # # This code filters out any rows where the value in the 'class column is 'neg'. The resulting filtered data is then saved to a CSV file named 'afs filtered.csv'. # # # In[91]: df_filtered = afs[afs['class'] != 'neg'] df_filtered.to_csv('afs-filtered.csv', index=False) # Reference: https://sparkbyexamples.com/pandas/pandas-filter-rows-by-conditions/ # In[92]: df_filtered.head(100) # I continue with my Exploratory Analysis of the dataset using the .info function. I also noticed that I have two feature data types: float64(12) and object (159). In dtypes, objects which either represents text, or a mixture of numeric and non-numeric values. I will need to fix this. # In[93]: df_filtered.info() # In[94]: df_filtered.shape # I gather some more informatiom about my dataset and see if I can spot any more issues. # # df_filtered.describe() - gathered some basic statistical information about the dataset. By default this will provide statistics for numerical columns, including measures such as the mean standard deviation, minimum value, 25th percentile, median (50th percentile) 75th percentile and maximum value. The result here indicates that only aa_000 is recognised as a numerical column. A further investigation into the dtypes is needed. # In[95]: df_filtered.describe() # # In[ ]: # I looked further into the data types to see more clearly which features were 'objects' and which were 'float64' datatypes. # # df_filtered.dtypes - Gives me the data types contained in each feature. # Below I begin my experimentation in how to change the dtypes so that they are all numeric. This process took me many interations and was the first difficultly I faced in the data pre-processing phase. Firstly, I attempted to change the datatype of one row that was marked as an 'object'. I choose 'ac_000' and rechecked the datatype after running df_filtered['ac_000'] = df_filtered['ac_000'].astype('float64'). I then check the dtypes again using df_filtered.dtypes and see that 'ac_000' is now classed as float64. Confirming that my experiment has worked. # In[96]: df_filtered.dtypes # Reference: https://pbpython.com/pandas_dtypes.html # In[97]: print(df_filtered.columns) # Below, I begin to remove non-numeric datapoints. # # import numpy as np - Imported the numpy library # missing_value_formats = ["n.a.","?","NA","n/a", "na"] - Categorised the missing value formats that I want to be removed. # df_filtered.replace(missing_value_formats, np.nan, inplace=True) Used the replace function to replace any missing value formats with np.nan. Panda will recognise nan as meaning 'Not a Number' and categorise it as a missing value. # print(df_filtered.head(150)) -Requested the first 150 lines to check if the method worked # # In[98]: missing_value_formats = ["n.a.","?","NA","n/a", "na"] df_filtered.replace(missing_value_formats, np.nan, inplace=True) # Now, that this issue is resolved, I can change the objects to float64. # Reference: https://stackoverflow.com/questions/66043989/how-to-replace-na-values-with-np-nan-file-imported-using-pandas-read-pickle # In[99]: object_columns = df_filtered.select_dtypes(include='object').columns object_columns = object_columns[object_columns != 'class'] df_filtered[object_columns] = df_filtered[object_columns].astype('float64') # References: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html # https://stackoverflow.com/questions/21720022/find-all-columns-of-dataframe-in-pandas-whose-type-is-float-or-a-particular-typ # https://datagy.io/pandas-convert-object-to-float/#:~:text=One%20of%20the%20most%20common,that%20you%20want%20to%20use. # In[100]: df_filtered.dtypes # Here, I check if this method has worked correctly. I can confirm that df_filtered.describe() is now recognising all the features # In[101]: df_filtered.describe() # Next, I remove features that have a number of missing values of over 50%. This means that a feature will only be kept if it has more, than half of its values available. By using the `dropna` function with `axis=1` we drop columns that have missing values than the threshold. This helps reduce my data by getting rid of features that are full of missing data # In[102]: threshold = 0.5 * len(df_filtered) df_filtered = df_filtered.dropna(thresh=threshold, axis=1) # Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html # https://www.w3schools.com/python/ref_func_len.asp#:~:text=Definition%20and%20Usage,of%20characters%20in%20the%20string. # In[103]: df_filtered.shape # I check the shape of the data to see how many have been removed. It's gone down from 171 to 144, so 27 have been removed with this method. I run my describe function to view the statistics again for reference. # In[104]: df_filtered.describe() # Doubling checking that the reference types have been changed. # In[105]: print(df_filtered.dtypes) # # # In[106]: df_filtered.head(50) # Below, I try to drop na values. I checked the shape before and after noticing no change but decided to keep it in here for reference. # In[107]: df_filtered.shape # In[108]: df_filtered = df_filtered.dropna(how='all') # In[109]: df_filtered.shape # I ran a correletion matrix as part of data preprocessing and noticed some blank rows. I double checked with the describe function and removed these rows to further reduce the data. # In[110]: pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None) pd.set_option('display.width', None) pd.set_option('display.max_colwidth', -1) correlation_matrix = df_filtered.corr() print(correlation_matrix) # In[111]: df_filtered.shape # In[112]: df_filtered.describe() print(df_filtered) # In[113]: df_filtered.shape # # In[114]: df_filtered.info() # In[115]: df_filtered.describe() # In[116]: df_filtered.isnull().values.any() # In[117]: df_filtered.isnull().sum() # In[118]: df_filtered.shape # In[119]: columns_list = df_filtered.columns.tolist() print(columns_list) # In[120]: df_filtered.count # In[121]: df_filtered.shape # In[122]: df_filtered = df_filtered[df_filtered.drop("class", axis=1).replace(0, np.nan).notna().any(axis=1)] # In[123]: df_filtered.shape # In[124]: df_filtered.count # In[125]: columns_list = df_filtered.columns.tolist() for col in columns_list: print(col) # In[126]: df_filtered.shape # Correlation Matrix Heatmap - The Curse of Dimensionality # In[127]: plt.figure(figsize=(40,20)) sns.heatmap(df_filtered.corr(), cmap="BrBG", annot=True) plt.show() # This correlation Heatmap is a mess. There is way too much information. This is a good example of the curse of dimensionality. We cannot see through the noise. Further dimenesionality reduction is needed. I will run an IQR and delete any features with an IQR of 0 or near 0. # # In[ ]: # In[128]: Q1 = df_filtered.quantile(0.25) Q3 = df_filtered.quantile(0.75) IQR = Q3 - Q1 print(IQR) # In[43]: df_filtered = df_filtered.drop(columns=['au_000','ay_009', 'az_009', 'cd_000', 'cs_009', 'dz_000', 'eg_000'], errors='ignore') # The above were dropped to reduce dimensionality in the dataset. All of the above had a IQR score of 0 and therefore did not add to my analysis as explained above. # In[44]: df_filtered.shape # In[135]: to_drop = ['ae_000', 'af_000', 'ag_000', 'as_000', 'ay_000', 'ay_001', 'ay_002', 'ay_003', 'ay_004', 'az_009', 'ea_000', 'ef_000'] df_filtered = df_filtered.drop(columns=to_drop, errors='ignore') # I can now see that a further features have been dropped. I used two different methods of dropping the columns for practice ;-) I can see below that these have also been removed. # In[136]: df_filtered.shape # In[137]: df_filtered.info # Graphs # # To visualise the dataframe I have a variety of charts used below, helping me to display the data in different interesting ways. See below for a range of boxplots: # Boxplots # In[138]: import seaborn as sns import matplotlib.pyplot as plt sns.boxplot(y=df_filtered['ag_009']) plt.title('Box plot of ag_009') plt.ylabel('ag_009') plt.show() # In[ ]: # In[139]: sns.boxplot(y=df_filtered['ai_000']) plt.title('Box plot of ai_000') plt.ylabel('ai_000') plt.show() # In[140]: sns.boxplot(y=df_filtered['aj_000']) plt.title('Box plot of aj_000') plt.ylabel('aj_000') plt.show() # In[141]: sns.boxplot(y=df_filtered['ay_005']) plt.title('Box plot of ay_005') plt.ylabel('ay_005') plt.show() # In[142]: sns.boxplot(y=df_filtered['az_003']) plt.title('Box plot of az_003') plt.ylabel('az_003') plt.show() # In[143]: sns.boxplot(y=df_filtered['az_004']) plt.title('Box plot of az_004') plt.ylabel('az_004') plt.show() # In[144]: sns.boxplot(y=df_filtered['az_004']) plt.title('Box plot of az_004') plt.ylabel('az_004') plt.show() # In[145]: sns.boxplot(y=df_filtered['cj_000']) plt.title('Box plot of cj_000') plt.ylabel('cj_000') plt.show() # In[146]: sns.boxplot(y=df_filtered['cn_000']) plt.title('Box plot of cn_000') plt.ylabel('cn_000') plt.show() # In[147]: sns.boxplot(y=df_filtered['cn_006']) plt.title('Box plot of cn_006') plt.ylabel('cn_006') plt.show() # In[148]: sns.boxplot(y=df_filtered['cn_009']) plt.title('Box plot of cn_009') plt.ylabel('cn_009') plt.show() # In[149]: sns.boxplot(y=df_filtered['cs_008']) plt.title('Box plot of cs_008') plt.ylabel('cs_008') plt.show() # In[150]: sns.boxplot(y=df_filtered['ee_006']) plt.title('Box plot of ee_006') plt.ylabel('ee_006') plt.show() # In[151]: sns.boxplot(y=df_filtered['ee_007']) plt.title('Box plot of ee_007') plt.ylabel('ee_007') plt.show() # In[152]: import seaborn as sns import matplotlib.pyplot as plt to_plot = ['aa_000', 'ac_000', 'ag_001', 'ag_002', 'ag_003', 'ag_004', 'ag_005', 'ag_006', 'ag_007', 'ag_008', 'ah_000', 'ee_008', 'ee_009', 'eg_000'] for col in to_plot: plt.figure(figsize=(10, 6)) sns.boxplot(y=df_filtered[col]) plt.title(f'Boxplot for {col}') plt.show() # During this process, I realsied there was a way to run the columns in bulk so I did half of them seperately and the other half in bulk. # Reference: https://seaborn.pydata.org/tutorial/categorical.html # https://realpython.com/python-enumerate/ # https://www.freecodecamp.org/news/python-for-loop-for-i-in-range-example/ # https://matplotlib.org/stable/tutorials/pyplot.html # During the data exploration phase we noticed that certain features didn't show variation, in the boxplot. When features have variability it means they don't contribute much to distinguishing between data points. Including features can make the model unnecessarily complex and noisy potentially causing overfitting. Overfitting occurs when a model performs well on training data but poorly, on data. Moreover removing these low variance features can improve the efficiency of the model training process. Result in an understandable model. Considering all these factors and aiming to keep our model simple and effective we made the decision to exclude these features from our dataset.It is for these reasons that I dropped the columns;using the 'columns_to_drop' function. # In[153]: df_filtered.shape # In[154]: columns_to_drop = ['cs_008', 'cn_009', 'cn_000', 'az_003', 'aj_000', 'ac_000', 'ag_001', 'ee_009'] df_filtered = df_filtered.drop(columns=columns_to_drop) # In[155]: df_filtered.shape # Reference: https://www.geeksforgeeks.org/how-to-drop-one-or-multiple-columns-in-pandas-dataframe/ # In[156]: high_corr_pairs = correlation_matrix.stack().where(lambda x: abs(x) > 0.99).dropna() high_corr_pairs = high_corr_pairs[high_corr_pairs < 1.0] print(high_corr_pairs) # In[157]: columns_to_drop = ['bt_000', 'bu_000', 'bv_000', 'cq_000', 'cc_000'] df_filtered = df_filtered.drop(columns=columns_to_drop) # In[158]: correlation_matrix = df_filtered.corr() high_corr_pairs = correlation_matrix.stack().where(lambda x: abs(x) > 0.85).dropna() high_corr_pairs = high_corr_pairs[high_corr_pairs < 1.0] print(high_corr_pairs) # In[159]: columns_to_drop = [ 'ci_000', 'af_000', 'cn_001', 'ag_004', 'cn_002', 'ag_007', 'an_000', 'am_0', 'ao_000', 'aq_000', 'ba_003', 'ba_004', 'bg_000', 'bh_000', 'bm_000', 'bn_000', 'bp_000', 'bq_000', 'by_000', 'cm_000', 'cn_003', 'cs_001', 'cn_004', 'dp_000', 'dt_000', 'ee_003', 'ee_004', 'ed_000' ] df_filtered = df_filtered.drop(columns=columns_to_drop) # I still have too many features so I will run a corr matrix and try to reduce the amount. # In[160]: df_filtered.shape # In[161]: correlation_matrix = df_filtered.corr() high_corr_pairs = correlation_matrix.stack().where(lambda x: abs(x) > 0.70).dropna() high_corr_pairs = high_corr_pairs[high_corr_pairs < 1.0] print(high_corr_pairs) # In[163]: columns_to_drop = [ 'ah_000', 'ea_000', 'ag_003', 'ag_005', 'ag_006', 'ah_000', 'al_000', 'bb_000', 'ap_000', 'az_001', 'ba_001', 'bx_000', 'ba_009', 'bb_000', 'bc_000', 'bj_000', 'bk_000', 'bo_000', 'cb_000', 'dn_000', 'dr_000', 'du_000', 'ec_00', 'ee_001', 'ee_002', 'ee_005' ] df_filtered = df_filtered.drop(columns=columns_to_drop, errors='ignore') # After reducing the threshold several times, I have gotten rid of highly correlated pairs. I have been able to reduce the dimensionality with this method # In[164]: remaining_columns = df_filtered.columns.tolist() remaining_columns # CatPlots # In[165]: df_filtered.shape # In[166]: correlation_matrix = df_filtered.corr() high_corr_pairs = correlation_matrix.stack().where(lambda x: abs(x) > 0.69).dropna() high_corr_pairs = high_corr_pairs[high_corr_pairs < 1.0] print(high_corr_pairs) # In[168]: columns_to_drop = [ 'at_000', 'ax_000', 'ay_001', 'ay_003', 'ay_004', 'az_002', 'az_008', 'ba_000', 'ba_002', 'ba_005', 'cn_005', 'cn_007', 'cs_005', 'cs_007', 'do_000', 'ds_000', 'dv_000', 'ee_000', 'ee_008', 'ef_000' ] df_filtered = df_filtered.drop(columns=columns_to_drop, errors='ignore') # In[169]: df_filtered.shape # I continue with my process of slightly lowering the threshold and removing columns. It would be easier to just lower the threshold from the start but I tried thjis and it nearly got rid of all the features so I decided upon this more carful approach. # In[170]: remaining_columns = df_filtered.columns.tolist() remaining_columns # In[ ]: Catplots of Remaining Columns # In[172]: import seaborn as sns import matplotlib.pyplot as plt for column in df_filtered.columns: if column != 'class': sns.catplot(y=df_filtered[column]) plt.title(f'Cat plot of {column}') plt.ylabel('Value') plt.show() # Above, there are catplots for the remaining features. These were created for the following reasons: # -To view the data density # -To analyse distribution across categories # Reference: https://seaborn.pydata.org/generated/seaborn.catplot.html # https://www.kdnuggets.com/2023/03/beginner-guide-pandas-melt-function.html # # In[173]: columns_to_drop = ['class'] df_filtered = df_filtered.drop(columns=columns_to_drop) # During the process of refining the dataset I decided to exclude the 'class column. I have already filtered the irrelevent data and keeping this column caused complications during the imputation stage. This situation highlights a problem in analyzing dimensional data known as the "curse of dimensionality." It occurs when certain dimensions not fail to provide information but also make the analysis more confusing or computationally inefficient. Therefore it's important to eliminate features to ensure the accuracy and efficiency of our analysis. # In[174]: from sklearn.impute import SimpleImputer from sklearn.decomposition import PCA import matplotlib.pyplot as plt imputer = SimpleImputer(strategy='mean') df_filtered_imputed = imputer.fit_transform(df_filtered) df_filtered = pd.DataFrame(df_filtered_imputed, columns=df_filtered.columns) pca = PCA().fit(df_filtered) plt.plot(np.cumsum(pca.explained_variance_ratio_)) plt.xlabel('number of components') plt.ylabel('cumulative explained variance') plt.show() # References: https://www.datacamp.com/tutorial/principal-component-analysis-in-python # https://builtin.com/machine-learning/pca-in-python # https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html # https://vitalflux.com/pca-explained-variance-concept-python-example/#:~:text=Explained%20variance%20represents%20the%20information,(eigenvector)%20with%20total%20eigenvalues. # In[79]: explained_variance = pca.explained_variance_ print(explained_variance) # In[80]: reduced_data = pca.transform(df_filtered) # # In[81]: variance_explained = [1.76510204e+17, 6.93057097e+16, 2.63385047e+14, 5.11764353e+13, 3.79200168e+13, 1.74403418e+13, 8.99232451e+12, 5.81681106e+12, 2.08572039e+12, 4.20811794e+11, 1.89148995e+11, 1.47912688e+11, 2.45330535e+10, 7.14234536e+08, 2.38068320e+08, 3.03715210e+07, 2.43266648e+07, 3.18523105e+06, 3.18583481e+02] cumulative_variance = np.cumsum(variance_explained) / np.sum(variance_explained) num_components_99 = np.where(cumulative_variance >= 0.99)[0][0] + 1 num_components_99 # In the context of Principal Component Analysis (PCA) to address the challenge of dealing with data my goal was to capture a significant portion of the variance, in the data while reducing its dimensionality. Specifically I aimed to encapsulate 99% of the variance in the dataset. To determine how many principal components were needed for this task I conducted the analysis; # # The array "variance_explained" provides information on how variance each component contributes, represented by values such as [1.76510204e+17, 6.93057097e+16... 0.00000000e+00]. These values give insights, into the proportion of variance contributed by each component. # # To evaluate variance and express it as a percentage of the variance I calculated "cumulative_variance" by summing up these individual variances using np.cumsum(). # # Afterwards I normalized this variance by dividing it with np.sum(variance_explained) to get an understanding of its percentage representation. # # The crucial step was determining the number of components required to account for least 99% of the total variance. This was achieved through np.where(cumulative_variance ≥ 0.99) which identified those components satisfying this condition. # After going through the process we can determine the number of components needed to achieve a variance threshold of 99% as follows; # # Firstly I calculated the variance and locate the index where it reaches or exceeds 0.99. To adjust for the difference, I added 1. # # I founnd that it requires two components to reach the desired threshold of 99% variance. # Reference: https://www.analyticsvidhya.com/blog/2016/03/pca-practical-guide-principal-component-analysis-python/ # To address the issue known as the "curse of dimensionality”, in Principal Component Analysis (PCA) my goal was to capture a portion of the datasets variance while reducing its dimensionality. I aimed to capture 99% of the variance. # # After analyzing the data I discovered that achieving this 99% threshold only required using 2 components. This reduction highlights how effective PCA is at reducing dimensionality especially when dealing with data. # # To accomplish this I utilized the PCA algorithm provided by the sklearn library. Here are the steps I followed; # # 1. Library Imports; # I imported the PCA algorithm from sklearn.decomposition. Used the pandas library for data manipulation. # # 2. PCA Initialization; # I initialized PCA with a number of components, which in this case was set to 2. # # 3. Applying PCA; # Next I applied PCA to the dataset. # # 4. Data Formatting; # To simplify analysis and visualization I transformed the output of PCA into a DataFrame. The columns were labeled according to their corresponding component. # # 5. Viewing Transformed Data; # Finally I displayed a preview of the rows, from the transformed data obtained through PCA. # # # After going through this process the result demonstrates how the initial data has been changed into two components. These components effectively capture the aspects of the datas variability while also reducing its overall complexity. # In[82]: from sklearn.decomposition import PCA import pandas as pd pca = PCA(n_components=2) df_pca_transformed = pca.fit_transform(df_filtered) df_pca_transformed = pd.DataFrame(df_pca_transformed, columns=[f'PC_{i+1}' for i in range(2)]) print(df_pca_transformed.head()) # PCA Plots # PC1 and PC2 # In[85]: import seaborn as sns import matplotlib.pyplot as plt plt.figure(figsize=(10, 7)) sns.scatterplot(x="PC_1", y="PC_2", data=df_pca_transformed) plt.title("2D Visualization of PCA-Transformed Data") plt.xlabel("Principal Component 1") plt.ylabel("Principal Component 2") plt.show() # In[84]: print(df_pca_transformed) # Reference: https://machinelearningmastery.com/principal-component-analysis-for-visualization/ # PCA Lecture notes # The Curse Of Dimensionality Report # # Dealing with the complexities of data is a common challenge, in the field of data science. When we work with data that has dimensions we often encounter what experts call the "Curse of Dimensionality." This term, coined by Richard E. Bellman refers to the difficulties that arise when analyzing and organizing data in spaces with a number of dimensions. Unlike the physical world these complexities are unique to high dimensional datasets. # # During my preparation for data analysis using Jupyter Notebook I faced this curse head on. The dataset I loaded from 'aps_failure_set.csv' consisted of columns, each representing a feature. Cleaning the data by identifying and removing columns with values was not about ensuring data quality; it was also a strategic approach to address the challenges posed by high dimensionality. # # Fortunately there are techniques to tackle dimensionality such, as Principal Component Analysis (PCA), IQR and Correlecation MAtrix. By applying these I transformed correlated data into a set of variables known as components. # These key components capture the essence of the data. Can greatly reduce its dimensionality while retaining most of the information (source; Analytics Vidhya, 2016). # # Many different fields face the challenges associated with dimensionality. Numerical analysis, machine learning, data mining and database managament/engineering are a few examples (source; KDnuggets, 2023). One common problem that arises as the number of dimensions increases is the expansion of space which leads to sparse data. To obtain results a significant amount of data is often required. This requirement grows exponentially as dimensionality increases—a pattern I observed in my notebook when certain columns started becoming increasingly sparse.I used a cumulitve explained variance chart to help me to decide. # # As datasets become more complex, in society and we collect bigger data sets, careful analysis is necessary to maintain performance in data analytics (source; McKinney, 2018). This represents the downside of " data." The curse of dimensionality is closely related to phenomena like the peaking phenomenon or Hughes phenomenon. Initially adding more dimensions can enhance a classifiers effectiveness; however it eventually starts diminishing over time. Striking a balance between the number of features and their cumulative discriminatory effect becomes crucial. # # When dealing with data that has dimensions it's important to consider how distance functions behave (Machine Learning Mastery, n.d.). In scenarios, with dimensions the differences in distances between pairs of points can sometimes become insignificant. This can make tasks like clustering or classification more complicated. However it's worth noting that not all dimensional situations are challenging. Interestingly there are methods known as the "blessing of dimensionality" that can effectively address high dimensional problems. # # To summarize the 'Curse of Dimensionality' presents both challenges and opportunities. As a data analyst I can make use of strategies for preparing the data and techniques, for reducing dimensionality. By doing I can fully harness the potential of dimensional data and confidently navigate through any noise present (Built In, n.d.). # # Description of the Dataset # # The dataset being examined (sourced from 'aps_failure_set.csv') has some key points to consider; # # Size; The dataset consists of observations that indicate its nature (W3Schools, n.d.). # Attributes; Multiple columns represent features or dimensions. Each column corresponds to an attribute or dimension, within the dataset (Stack Overflow, 2014). # Missing Values; Initially there were columns in the dataset that contained some values. The exact number and extent of these missing values were determined during the data preparation phase (Stack Overflow, 2021; Pandas Documentation, n.d.). # Observations; The rows in the dataset represent observations that provide insight into the data at hand. # Application of Data Preparation/Evaluation Methods; # # Thorough preprocessing was performed on the data to ensure its suitability for analysis. # # Important steps taken included; # # Cleaning; Columns, with a number of missing values were marked for removal (Pandas Documentation, n.d.). This not assists in creating a dataset but also tackles the challenge posed by high dimensionality (Datagy, n.d.). # # We utilized visualization techniques such, as catplots, box plots and scatter plots to conduct Exploratory Data Analysis (EDA). These visual representations allowed us to gain insights into the distribution of the dataset enabling us to identify trends, outliers and patterns. I used these techniques to help me remove features that I deemed unnesscessary. # # In order to address the challenge of handling data we employed Principal Component Analysis (PCA) along with IQR. Here's how it worked; # # Determining Variance; Initially we applied PCA to determine the number of components to retain 99% of the variance, within the data. # # Reducing Dimensionality; Once this determination was made we implemented PCA to reduce the datasets dimensions while preserving most of its information. # # This approach was driven by the 'Curse of Dimensionality.' By reducing dimensions we made the data mitigated potential risks associated with high dimensional datasets.The concept known as the 'Curse of Dimensionality' refers to the challenges that arise when dealing with data. As the number of features or attributes increases the amount of space occupied by the data grows exponentially. This can make it difficult to analyze the data effectively. May lead companies to believe that their datasets are not useful. # # In relation, to the aps.failures dataset increasing the number of dimensions presents challenges such as reduced accuracy, overfitting and higher computational costs for businesses. # # Key Takeaways; # # The dataset named 'aps_failure_set.csv' is quite extensive containing 171 features. However by preprocessing the data removing features with an amount of information applying techniques like Principal Component Analysis (PCA) and IQR (Interquartile Range) we were able to extract its essential information while handling these difficulties. This process has provided insights into overcoming the obstacles posed by the 'Curse of Dimensionality'. Emphasizes important steps that data scientists must take for accurate and reliable analyses. It also highlights how crucial preprocessing is in this stage. # # Key Steps and Findings; # # Initial Data Exploration; To gain an understanding of whats contained within the dataset I initially examined rows, from both ends (beginning and end).During our analysis we noticed characteristics, in the data that may have been redundant or contained information. We also came across missing values that required our attention (Pandas Documentation, n.d.). # # To clean the data we focused on failures related to APS. Filtered out observations associated with the 'negative class. To ensure analysis we replaced non numeric data points with placeholders like 'np.nan' (Stack Overflow, 2021). Furthermore we carefully discarded features that had a percentage of missing values (GeeksforGeeks, n.d.) to prevent any misleading information. # # Consistency in data types across all features was crucial for analysis and modeling purposes (Datagy, n.d.). To achieve this consistency we transformed 'object' type columns into their data types. # # To gain insights into the distribution and relationships within the data we incorporated visualizations such as boxplots, into our analysis (Seaborn Documentation, n.d.). # # Throughout this process we encountered challenges when dealing with non numeric data points in the dataset. Our approach involved converting these values into a format while effectively managing missing values.Furthermore the sheer abundance of features posed another obstacle in determining which ones truly offered insights. # # Taking into account the context of predicting failures, in Scania trucks APS system every step taken in this analysis carries significance. Having clean, pertinent and consistent data is essential. # # By considering the features and their quality this analysis sets the groundwork that holds potential to reduce costs improve safety and optimize investments for the haulage company. # # Thorough data preparation ensures that only valuable information is fed into any given model establishing a basis, for actionable outcomes. # # Conclusions of my Data Analysis # # During my exploration of the data analysis process, I encountered the challenges posed by data. To overcome these challenges and avoid pitfalls, I decided to employ Principal Component Analysis (PCA), a technique used to address what's known as the "Curse of Dimensionality" (DataCamp, n.d.; Built In, n.d.). # # Applying PCA: # To tackle the issue of dimensionality while retaining information, I utilized the PCA technique provided by the sklearn library (Scikit-learn Documentation, n.d.). My goal was to reduce the dataset to its two main components. After performing the PCA transformation on the dataset, I obtained a dataframe called "df_pca_transformed" with two columns; PC_1 and PC_2. These columns represent the first and second principal components respectively (Analytics Vidhya, 2016). # # Key Findings: # Upon examining the transformed data, it becomes evident that the principal components span a range of values, encompassing both negative and positive. This indicates that PCA has successfully captured variance within our dataset (Vitalflux, n.d.). Furthermore, it is noticeable that PC_1 generally exhibits larger magnitudes compared to PC_2. This outcome aligns with our expectations since the first principal component is intended to capture as much variance as possible within our data (Machine Learning Mastery, n.d.). Some of the data points in PC_1 have values around 10 to the power of 9, suggesting that there might be outliers or influential points in the dataset. # # Suggestions for Further Analysis # # Outlier Analysis: Given the range of values observed in the components, it would be beneficial to conduct an analysis specifically focusing on identifying any outliers (Pandas Documentation, n.d.). # # Further Analysis: With dimensionality now reduced, further analyses such as clustering or classification can be conducted more efficiently (GeeksforGeeks, n.d.). # # In summary, applying PCA has opened up possibilities for conducting potentially more insightful analyses. The reduced dimensionality preserves essential information and showcases PCA's effectiveness in handling high-dimensional datasets (McKinney, 2018). # # I'm excited to explore this dataset using the foundation created by this PCA transformation and discover patterns and insights. # # # References # # Sparkbyexamples. (n.d.). Pandas Filter Rows By Conditions. Available at: https://sparkbyexamples.com/pandas/pandas-filter-rows-by-conditions/ [Accessed 02/11/2023]. # # Stack Overflow. (2021). How to replace NA values with np.nan file imported using pandas read_pickle. Available at: https://stackoverflow.com/questions/66043989/how-to-replace-na-values-with-np-nan-file-imported-using-pandas-read-pickle [Accessed 02/11/2023]. # # Pandas Documentation. (n.d.). pandas.DataFrame.select_dtypes. Available at: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html [Accessed 02/11/2023]. # # Stack Overflow. (2014). Find all columns of dataframe in pandas whose type is float or a particular type. Available at: https://stackoverflow.com/questions/21720022/find-all-columns-of-dataframe-in-pandas-whose-type-is-float-or-a-particular-typ [Accessed 02/11/2023]. # # Datagy. (n.d.). Pandas Convert Object to Float. Available at: https://datagy.io/pandas-convert-object-to-float/ [Accessed 02/11/2023]. # # Pandas Documentation. (n.d.). pandas.DataFrame.dropna. Available at: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html [Accessed 02/11/2023]. # # W3Schools. (n.d.). Python len() Function. Available at: https://www.w3schools.com/python/ref_func_len.asp [Accessed 02/11/2023]. # # Seaborn Documentation. (n.d.). Categorical plots. Available at: https://seaborn.pydata.org/tutorial/categorical.html [Accessed 02/11/2023]. # # Real Python. (n.d.). Python's enumerate() Function. Available at: https://realpython.com/python-enumerate/ [Accessed 02/11/2023]. # # FreeCodeCamp. (n.d.). Python for Loop and For...In: For i in Range Examples Explained. Available at: https://www.freecodecamp.org/news/python-for-loop-for-i-in-range-example/ [Accessed 02/11/2023]. # # Matplotlib. (n.d.). Pyplot tutorial. Available at: https://matplotlib.org/stable/tutorials/pyplot.html [Accessed 02/11/2023]. # # GeeksforGeeks. (n.d.). How to drop one or multiple columns in Pandas DataFrame?. Available at: https://www.geeksforgeeks.org/how-to-drop-one-or-multiple-columns-in-pandas-dataframe/ [Accessed 02/11/2023]. # # Seaborn Documentation. (n.d.). seaborn.catplot. Available at: https://seaborn.pydata.org/generated/seaborn.catplot.html [Accessed 02/11/2023]. # # KDnuggets. (2023). Beginner’s Guide to Pandas Melt Function. Available at: https://www.kdnuggets.com/2023/03/beginner-guide-pandas-melt-function.html [Accessed 02/11/2023]. # # DataCamp. (n.d.). Principal Component Analysis in Python. Available at: https://www.datacamp.com/tutorial/principal-component-analysis-in-python [Accessed 02/11/2023]. # # Built In. (n.d.). PCA in Python. Available at: https://builtin.com/machine-learning/pca-in-python [Accessed 02/11/2023]. # # Scikit-learn Documentation. (n.d.). sklearn.decomposition.PCA. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html [Accessed 02/11/2023]. # # Vitalflux. (n.d.). PCA Explained Variance Concept with Python Example. Available at: https://vitalflux.com/pca-explained-variance-concept-python-example/ [Accessed 02/11/2023]. # # Analytics Vidhya. (2016). A Complete Guide to Principal Component Analysis - PCA in Python. Available at: https://www.analyticsvidhya.com/blog/2016/03/pca-practical-guide-principal-component-analysis-python/ [Accessed 02/11/2023]. # # Machine Learning Mastery. (n.d.). Principal Component Analysis for Visualization. Available at: https://machinelearningmastery.com/principal-component-analysis-for-visualization/ [Accessed 02/11/2023]. # # McKinney, W., 2018. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2nd ed. [Book] # # McQuaid, D., 2023. What is Exploratory Data Analysis? [Lecture notes] Given on 2 Oct 2023. # # In[ ]: