# Pandas and NumPy to deal with data
import pandas as pd
import numpy as np
# Import the required module from sklearn to perform PCA
from sklearn.decomposition import PCA
Let's read the dataset companies.csv
and analyze it
df = pd.read_csv("companies.csv")
df.head()
employees | revenue_usd | |
---|---|---|
0 | 554.0 | 1443509.0 |
1 | 1401.0 | 3378243.0 |
2 | 1411.0 | 3300592.0 |
3 | 1415.0 | 3448365.0 |
4 | 825.0 | 1984168.0 |
Let's visualize both of these columns
# we'll use column INDEX (0 and 1) instead of names ("employees" and "revenue_usd") because it's shorter to type!
ax = df.plot.scatter(x=0,y=1, title="Company Data")
# What about the units used here ???
# ax.axis('equal')
We can see from this data that the revenue of a company is strongly related to the number of employees the company has.
Revenue tends to increase as the number of employees increase.
# Import the required module from sklearn for data normalization
from sklearn.preprocessing import StandardScaler
# StandardScaler is used for data normalization
normalize = StandardScaler()
# We define a StandardScaler and then we fit it to our data
normalize.fit(df)
# After running the fit method, the normalize object wil have a mean_ and scale_(std) attribute
print("Mean of the data is:", normalize.mean_)
print("Standard Deviation of the data is:", normalize.scale_)
# Now we can standardize the data using the transform method.
numpy_norm = normalize.transform(df)
# .transform returns a NumPy array, which we then convert into a Pandas DataFrame.
df_norm = pd.DataFrame(numpy_norm, columns=["employees_norm","revenue_usd_norm"])
df_norm.head()
Mean of the data is: [1.1071506e+03 2.6392633e+06] Standard Deviation of the data is: [5.02513947e+02 1.21773397e+06]
employees_norm | revenue_usd_norm | |
---|---|---|
0 | -1.100767 | -0.981950 |
1 | 0.584759 | 0.606848 |
2 | 0.604659 | 0.543081 |
3 | 0.612619 | 0.664432 |
4 | -0.561478 | -0.537963 |
Now, let's visualize the normalized data
df_norm.plot.scatter(x=0, y=1, title="Normalized Data");
PCA
on the normalized data.¶pca = PCA(n_components=1) # since this is 2-d dataset, we can only reduce it to 1-d!
pca.fit(df_norm);
data_pca = pca.transform(df_norm)
print("The original data has shape", df.shape )
print("The transformed data has shape", data_pca.shape )
The original data has shape (923, 2) The transformed data has shape (923, 1)
PCA
on our data, we have reduced the number of dimensions! Let's see how can we get back the original number of dimensions by doing an inverse transformation.¶data_inv = pca.inverse_transform(data_pca)
print("The inverse transformed data has shape", data_inv.shape)
print("This is the same as the shape of the original data!")
# Convert the inverse transformed data into a dataframe
df_norm_inv = pd.DataFrame(data_inv, columns=df_norm.columns)
The inverse transformed data has shape (923, 2) This is the same as the shape of the original data!
Aside: Here we plot two plots on top of each other. This is done by making the axes of the two plots same. We get the axis of the first plot ax1
and make it same as the axis of the second plot using ax=ax1
.
Aside: The alpha
argument makes the plot transparent. We can see that the first plot is lighter and the second is much darker as we ahve passed alpha=.2
for the first plot and alpha=1
for the second plot
ax1 = df_norm.plot.scatter(x=0, y=1, alpha=.2, color='r', label="Original Data (norm)")
df_norm_inv.plot.scatter(x=0, y=1, alpha=1, ax=ax1, label="PCA Projected Data (norm)");
# plot back in the original coordinates!
df_inv = pd.DataFrame(normalize.inverse_transform(df_norm_inv), columns=df.columns)
ax2 = df.plot.scatter(x=0, y=1, alpha=.2, color='r', label="Original Data")
df_inv.plot.scatter(x=0, y=1, alpha=1, ax=ax2, label="PCA Projected Data");
We can see from the above plot that PCA has done a great job reducing the number of dimensions of the data. The information along the least important principal axis is removed, leaving only the component of the data with the highest variance.
pca.explained_variance_ratio_
array([0.99035046])
Read the data in
companies.csv
. Fit a StandardScaler on both theemployees
andrevenue_usd
variables. Now we have a data point of a new company having employees = 1020, revenue_usd = 300321. Find the normalized data point and choose it from the options given.
# read the data
dfq1 = pd.read_csv("companies.csv")
# fit with StandardScaler to normalize
# Make the new datapoint into a new 1-row dataframe and then transform!
Read the data in
companies_extended.csv
. Now run PCA on this dataset, first with 4 components, and then with 2 components. Is the first Principal Component same for both of these runs?
### read the data
dfq2 = pd.read_csv("companies_extended.csv")
### fit PCA with 2 components and transform
### fit PCA with 4 components and transform
### Check if the first PC is same
Read in
companies_extended.csv
. Reduce the number of features/dimensions to2
, and then apply the Inverse transform to get the data back into original number of dimensions. What is the mean value of the reconstructedemployees
feature (The first column)?Round the value to two decimal places
NOTE:
Skip data normalization for this question
### read the data
dfq3 = pd.read_csv("companies_extended.csv")
### use PCA to reduce the data from 5-D to 2D
### now use inverse transform to get the data back in 5-D
### now report the mean of employee column (The first column)