In this lab we're going to do simple data exploration using Pandas. Our objective is to learn basic operations that aid in visual data exploration.
First, let's import our required libraries and read in the data.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
#loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')
#/Users/briand/Desktop/ds course/datasets
loansData = pd.read_csv('/Users/briand/Desktop/ds course/datasets/loansData.csv')
The full schema can be found here, but let's use native Pandas methods to also explore the data. Although not specified above, this csv has row headers and the read_csv function implicitly knows this.
Let's take a look at the column names, in a nicely readable way:
print ("Column Names Are:")
print ("")
for column_name in loansData.columns.values:
print (column_name)
print ("")
print (f"Total # of Columns = {len(loansData.columns)}")
Column Names Are: CustNUm Amount.Requested Amount.Funded.By.Investors Interest.Rate Loan.Length Loan.Purpose Debt.To.Income.Ratio State Home.Ownership Monthly.Income FICO.Range Open.CREDIT.Lines Revolving.CREDIT.Balance Inquiries.in.the.Last.6.Months Employment.Length Total # of Columns = 15
It is oftentimes helpful to take a quick glance at the first few records (when possible). This is an easy way to discover basic data flaws (i.e., all nulls, misaligned fields, etc.). We can do this in Pandas very easily.
loansData.head().transpose() #We transpose it so it will fit in the display window
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
CustNUm | 81174 | 99592 | 80059 | 15825 | 33182 |
Amount.Requested | 20000 | 19200 | 35000 | 10000 | 12000 |
Amount.Funded.By.Investors | 20000 | 19200 | 35000 | 9975 | 12000 |
Interest.Rate | 8.90% | 12.12% | 21.98% | 9.99% | 11.71% |
Loan.Length | 36 months | 36 months | 60 months | 36 months | 36 months |
Loan.Purpose | debt_consolidation | debt_consolidation | debt_consolidation | debt_consolidation | credit_card |
Debt.To.Income.Ratio | 14.90% | 28.36% | 23.81% | 14.30% | 18.78% |
State | SC | TX | CA | KS | NJ |
Home.Ownership | MORTGAGE | MORTGAGE | MORTGAGE | MORTGAGE | RENT |
Monthly.Income | 6541.67 | 4583.33 | 11500 | 3833.33 | 3195 |
FICO.Range | 735-739 | 715-719 | 690-694 | 695-699 | 695-699 |
Open.CREDIT.Lines | 14 | 12 | 14 | 10 | 11 |
Revolving.CREDIT.Balance | 14272 | 11140 | 21977 | 9346 | 14469 |
Inquiries.in.the.Last.6.Months | 2 | 1 | 1 | 0 | 0 |
Employment.Length | < 1 year | 2 years | 2 years | 5 years | 9 years |
Next, let's run a quick line to get summary statistics of the numeric fields.
loansData.describe().transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
CustNUm | 2500 | 51631.597200 | 30053.345508 | 10.00 | 26523.00 | 50653 | 77580.50 | 104202 |
Amount.Requested | 2500 | 12406.500000 | 7801.544872 | 1000.00 | 6000.00 | 10000 | 17000.00 | 35000 |
Amount.Funded.By.Investors | 2500 | 12001.573236 | 7745.320754 | -0.01 | 6000.00 | 10000 | 16000.00 | 35000 |
Monthly.Income | 2499 | 5688.931321 | 3963.118185 | 588.50 | 3500.00 | 5000 | 6800.00 | 102750 |
Open.CREDIT.Lines | 2498 | 10.075661 | 4.508644 | 2.00 | 7.00 | 9 | 13.00 | 38 |
Revolving.CREDIT.Balance | 2498 | 15244.559648 | 18308.549795 | 0.00 | 5585.75 | 10962 | 18888.75 | 270800 |
Inquiries.in.the.Last.6.Months | 2498 | 0.906325 | 1.231036 | 0.00 | 0.00 | 0 | 1.00 | 9 |
Let's now focus on a single column, 'Monthly.Income.' First things first, let's display the first five records of just this field.
#Student input line here
loansData['Monthly.Income'][0:5]
0 6541.67 1 4583.33 2 11500.00 3 3833.33 4 3195.00 Name: Monthly.Income, dtype: float64
Let's try and understand the distribution of this field. We can do this using the hist() method and matplotlib.
plt.figure()
hist_inc = loansData['Monthly.Income'].hist()
plt.title('Histogram of Monthly Income')
plt.show()
How would you characterize the shape of this distribution? Is there anything we can do to the income variable to make the distribution more bell curved? Let's create a new column in the dataframe called 'Monthly.LogIncome' and print a histogram of it. What might be some advantages of making such a transformation?
#Student input line here
loansData['Monthly.LogIncome'] = np.log(loansData['Monthly.Income'])
plt.figure()
h = loansData['Monthly.LogIncome'].hist()
plt.title('Histogram of Log(Monthly Income)')
plt.show()
Now let's answer some questions about the data.
Q1: What is the cardinality (i.e., # of distinct values) for 'Interest.Rate' and 'FICO.Range'
#Student input code here
print (f"Cardinality of Interest.Rate = {loansData['Interest.Rate'].describe()['unique']}")
print (f"Cardinality of FICO.Range = {loansData['FICO.Range'].describe()['unique']}")
Cardinality of Interest.Rate = 275 Cardinality of FICO.Range = 38
Q2: What data type did Pandas set fo Interest.Rate? Can we create a new field which is stored as a float?
#Student input code here - show the field type
type(loansData['Interest.Rate'][0])
str
In the cell below, create a new field 'Interest.Rate.Num' where 'Interest.Rate' is converted to a float. Hint: this can be done in one line using the .str assessor, strip and astype methods.
#Student input code here - convert field to a float using the '.str' assessor
loansData['Interest.Rate.Num'] = loansData['Interest.Rate'].str.strip('%').astype(float)
Q3: Can we get a sense of the relationship between monthly income and interest rate? Use the scatter() function from Matplotlib.
#Student input line here
plt.scatter(loansData['Monthly.Income'], loansData['Interest.Rate.Num'])
<matplotlib.collections.PathCollection at 0x10dd280d0>
Given the skewed distribution of Income, is this chart visually helpful? Let's try the Log of income instead.
#Student input code here
plt.scatter(loansData['Monthly.LogIncome'], loansData['Interest.Rate.Num'])
<matplotlib.collections.PathCollection at 0x10f32fdd0>
Q4: What is the average interest rate for each FICO range?
Hint: use the groupby() method in Pandas.
#Student input code here
fico_grp = loansData[['Interest.Rate.Num', 'FICO.Range']].groupby(['FICO.Range']).mean()
fico_grp
Interest.Rate.Num | |
---|---|
FICO.Range | |
640-644 | 15.212000 |
645-649 | 14.883333 |
650-654 | 15.130000 |
655-659 | 14.930000 |
660-664 | 18.492480 |
665-669 | 17.448069 |
670-674 | 16.248480 |
675-679 | 15.854699 |
680-684 | 15.126688 |
685-689 | 14.654855 |
690-694 | 14.732071 |
695-699 | 14.146863 |
700-704 | 13.356870 |
705-709 | 12.659851 |
710-714 | 12.433125 |
715-719 | 11.181935 |
720-724 | 11.035789 |
725-729 | 10.652979 |
730-734 | 9.956170 |
735-739 | 9.623538 |
740-744 | 9.592264 |
745-749 | 9.901667 |
750-754 | 8.467705 |
755-759 | 8.996087 |
760-764 | 8.627609 |
765-769 | 7.783333 |
770-774 | 6.752941 |
775-779 | 8.744091 |
780-784 | 7.589286 |
785-789 | 8.497895 |
790-794 | 7.560500 |
795-799 | 8.391538 |
800-804 | 7.638462 |
805-809 | 7.426364 |
810-814 | 8.293750 |
815-819 | 6.925000 |
820-824 | 7.900000 |
830-834 | 7.620000 |