In this lab we're going to do simple data exploration using Pandas. Our objective is to learn basic operations that aid in visual data exploration.

First, let's import our required libraries and read in the data.

In [2]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline


#loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')
#/Users/briand/Desktop/ds course/datasets

loansData = pd.read_csv('/Users/briand/Desktop/ds course/datasets/loansData.csv')

The full schema can be found here, but let's use native Pandas methods to also explore the data. Although not specified above, this csv has row headers and the read_csv function implicitly knows this.

Let's take a look at the column names, in a nicely readable way:

In [8]:

print ("Column Names Are:")
print ("")

for column_name in loansData.columns.values:
    print (column_name)

print ("")
print (f"Total # of Columns = {len(loansData.columns)}")

Column Names Are:

CustNUm
Amount.Requested
Amount.Funded.By.Investors
Interest.Rate
Loan.Length
Loan.Purpose
Debt.To.Income.Ratio
State
Home.Ownership
Monthly.Income
FICO.Range
Open.CREDIT.Lines
Revolving.CREDIT.Balance
Inquiries.in.the.Last.6.Months
Employment.Length

Total # of Columns = 15

It is oftentimes helpful to take a quick glance at the first few records (when possible). This is an easy way to discover basic data flaws (i.e., all nulls, misaligned fields, etc.). We can do this in Pandas very easily.

In [96]:

loansData.head().transpose() #We transpose it so it will fit in the display window

Out[96]:

	0	1	2	3	4
CustNUm	81174	99592	80059	15825	33182
Amount.Requested	20000	19200	35000	10000	12000
Amount.Funded.By.Investors	20000	19200	35000	9975	12000
Interest.Rate	8.90%	12.12%	21.98%	9.99%	11.71%
Loan.Length	36 months	36 months	60 months	36 months	36 months
Loan.Purpose	debt_consolidation	debt_consolidation	debt_consolidation	debt_consolidation	credit_card
Debt.To.Income.Ratio	14.90%	28.36%	23.81%	14.30%	18.78%
State	SC	TX	CA	KS	NJ
Home.Ownership	MORTGAGE	MORTGAGE	MORTGAGE	MORTGAGE	RENT
Monthly.Income	6541.67	4583.33	11500	3833.33	3195
FICO.Range	735-739	715-719	690-694	695-699	695-699
Open.CREDIT.Lines	14	12	14	10	11
Revolving.CREDIT.Balance	14272	11140	21977	9346	14469
Inquiries.in.the.Last.6.Months	2	1	1	0	0
Employment.Length	< 1 year	2 years	2 years	5 years	9 years

Next, let's run a quick line to get summary statistics of the numeric fields.

In [97]:

loansData.describe().transpose()

Out[97]:

	count	mean	std	min	25%	50%	75%	max
CustNUm	2500	51631.597200	30053.345508	10.00	26523.00	50653	77580.50	104202
Amount.Requested	2500	12406.500000	7801.544872	1000.00	6000.00	10000	17000.00	35000
Amount.Funded.By.Investors	2500	12001.573236	7745.320754	-0.01	6000.00	10000	16000.00	35000
Monthly.Income	2499	5688.931321	3963.118185	588.50	3500.00	5000	6800.00	102750
Open.CREDIT.Lines	2498	10.075661	4.508644	2.00	7.00	9	13.00	38
Revolving.CREDIT.Balance	2498	15244.559648	18308.549795	0.00	5585.75	10962	18888.75	270800
Inquiries.in.the.Last.6.Months	2498	0.906325	1.231036	0.00	0.00	0	1.00	9

Let's now focus on a single column, 'Monthly.Income.' First things first, let's display the first five records of just this field.

In [98]:

#Student input line here
loansData['Monthly.Income'][0:5]

Out[98]:

0     6541.67
1     4583.33
2    11500.00
3     3833.33
4     3195.00
Name: Monthly.Income, dtype: float64

Let's try and understand the distribution of this field. We can do this using the hist() method and matplotlib.

In [99]:

plt.figure()
hist_inc = loansData['Monthly.Income'].hist()
plt.title('Histogram of Monthly Income')
plt.show()

How would you characterize the shape of this distribution? Is there anything we can do to the income variable to make the distribution more bell curved? Let's create a new column in the dataframe called 'Monthly.LogIncome' and print a histogram of it. What might be some advantages of making such a transformation?

In [100]:

#Student input line here
loansData['Monthly.LogIncome'] = np.log(loansData['Monthly.Income']) 
plt.figure()
h = loansData['Monthly.LogIncome'].hist()
plt.title('Histogram of Log(Monthly Income)')
plt.show()

Now let's answer some questions about the data.

Q1: What is the cardinality (i.e., # of distinct values) for 'Interest.Rate' and 'FICO.Range'

In [9]:

#Student input code here

print (f"Cardinality of Interest.Rate = {loansData['Interest.Rate'].describe()['unique']}")
print (f"Cardinality of FICO.Range = {loansData['FICO.Range'].describe()['unique']}")

Cardinality of Interest.Rate = 275
Cardinality of FICO.Range = 38

Q2: What data type did Pandas set fo Interest.Rate? Can we create a new field which is stored as a float?

In [102]:

#Student input code here - show the field type
type(loansData['Interest.Rate'][0])

Out[102]:

str

In the cell below, create a new field 'Interest.Rate.Num' where 'Interest.Rate' is converted to a float. Hint: this can be done in one line using the .str assessor, strip and astype methods.

In [103]:

#Student input code here - convert field to a float using the '.str' assessor
loansData['Interest.Rate.Num'] = loansData['Interest.Rate'].str.strip('%').astype(float)

Q3: Can we get a sense of the relationship between monthly income and interest rate? Use the scatter() function from Matplotlib.

In [104]:

#Student input line here
plt.scatter(loansData['Monthly.Income'], loansData['Interest.Rate.Num'])

Out[104]:

<matplotlib.collections.PathCollection at 0x10dd280d0>

Given the skewed distribution of Income, is this chart visually helpful? Let's try the Log of income instead.

In [105]:

#Student input code here
plt.scatter(loansData['Monthly.LogIncome'], loansData['Interest.Rate.Num'])

Out[105]:

<matplotlib.collections.PathCollection at 0x10f32fdd0>

Q4: What is the average interest rate for each FICO range?
Hint: use the groupby() method in Pandas.

In [106]:

#Student input code here
fico_grp = loansData[['Interest.Rate.Num', 'FICO.Range']].groupby(['FICO.Range']).mean()

In [109]:

fico_grp

Out[109]:

	Interest.Rate.Num
FICO.Range
640-644	15.212000
645-649	14.883333
650-654	15.130000
655-659	14.930000
660-664	18.492480
665-669	17.448069
670-674	16.248480
675-679	15.854699
680-684	15.126688
685-689	14.654855
690-694	14.732071
695-699	14.146863
700-704	13.356870
705-709	12.659851
710-714	12.433125
715-719	11.181935
720-724	11.035789
725-729	10.652979
730-734	9.956170
735-739	9.623538
740-744	9.592264
745-749	9.901667
750-754	8.467705
755-759	8.996087
760-764	8.627609
765-769	7.783333
770-774	6.752941
775-779	8.744091
780-784	7.589286
785-789	8.497895
790-794	7.560500
795-799	8.391538
800-804	7.638462
805-809	7.426364
810-814	8.293750
815-819	6.925000
820-824	7.900000
830-834	7.620000

In [ ]: