Using Pandas to Get Familiar With Your Data¶

The first thing you'll want to do is familiarize yourself with the data. You'll use the Pandas library for this. Pandas is the primary tool that modern data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as pd. We do this with the command

In [1]:

import pandas as pd

The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database. The Pandas DataFrame has powerful methods for most things you'll want to do with this type of data. Let's start by looking at a basic data overview with our example data from Melbourne and the data you'll be working with from Iowa.

We load and explore the data with the following:

In [2]:

# save filepath to variable for easier access
file_path = 'train.csv'
# read the data and store data in DataFrame titled df
df = pd.read_csv(file_path) 
# show the data
df

Out[2]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	12	2008	WD	Normal	250000
5	6	50	RL	85.0	14115	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	MnPrv	Shed	700	10	2009	WD	Normal	143000
6	7	20	RL	75.0	10084	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	8	2007	WD	Normal	307000
7	8	60	RL	NaN	10382	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	Shed	350	11	2009	WD	Normal	200000
8	9	50	RM	51.0	6120	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	4	2008	WD	Abnorml	129900
9	10	190	RL	50.0	7420	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	1	2008	WD	Normal	118000
10	11	20	RL	70.0	11200	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	2	2008	WD	Normal	129500
11	12	60	RL	85.0	11924	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	7	2006	New	Partial	345000
12	13	20	RL	NaN	12968	Pave	NaN	IR2	Lvl	AllPub	...	0	NaN	NaN	NaN	0	9	2008	WD	Normal	144000
13	14	20	RL	91.0	10652	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	8	2007	New	Partial	279500
14	15	20	RL	NaN	10920	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	GdWo	NaN	0	5	2008	WD	Normal	157000
15	16	45	RM	51.0	6120	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	GdPrv	NaN	0	7	2007	WD	Normal	132000
16	17	20	RL	NaN	11241	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	Shed	700	3	2010	WD	Normal	149000
17	18	90	RL	72.0	10791	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	Shed	500	10	2006	WD	Normal	90000
18	19	20	RL	66.0	13695	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2008	WD	Normal	159000
19	20	20	RL	70.0	7560	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	5	2009	COD	Abnorml	139000
20	21	60	RL	101.0	14215	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	11	2006	New	Partial	325300
21	22	45	RM	57.0	7449	Pave	Grvl	Reg	Bnk	AllPub	...	0	NaN	GdPrv	NaN	0	6	2007	WD	Normal	139400
22	23	20	RL	75.0	9742	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	9	2008	WD	Normal	230000
23	24	120	RM	44.0	4224	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2007	WD	Normal	129900
24	25	20	RL	NaN	8246	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	5	2010	WD	Normal	154000
25	26	20	RL	110.0	14230	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	7	2009	WD	Normal	256300
26	27	20	RL	60.0	7200	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2010	WD	Normal	134800
27	28	20	RL	98.0	11478	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2010	WD	Normal	306000
28	29	20	RL	47.0	16321	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	12	2006	WD	Normal	207500
29	30	30	RM	60.0	6324	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2008	WD	Normal	68500
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1430	1431	60	RL	60.0	21930	Pave	NaN	IR3	Lvl	AllPub	...	0	NaN	NaN	NaN	0	7	2006	WD	Normal	192140
1431	1432	120	RL	NaN	4928	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	10	2009	WD	Normal	143750
1432	1433	30	RL	60.0	10800	Pave	Grvl	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	8	2007	WD	Normal	64500
1433	1434	60	RL	93.0	10261	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2008	WD	Normal	186500
1434	1435	20	RL	80.0	17400	Pave	NaN	Reg	Low	AllPub	...	0	NaN	NaN	NaN	0	5	2006	WD	Normal	160000
1435	1436	20	RL	80.0	8400	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	GdPrv	NaN	0	7	2008	COD	Abnorml	174000
1436	1437	20	RL	60.0	9000	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	GdWo	NaN	0	5	2007	WD	Normal	120500
1437	1438	20	RL	96.0	12444	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	11	2008	New	Partial	394617
1438	1439	20	RM	90.0	7407	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	4	2010	WD	Normal	149700
1439	1440	60	RL	80.0	11584	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	11	2007	WD	Normal	197000
1440	1441	70	RL	79.0	11526	Pave	NaN	IR1	Bnk	AllPub	...	0	NaN	NaN	NaN	0	9	2008	WD	Normal	191000
1441	1442	120	RM	NaN	4426	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2008	WD	Normal	149300
1442	1443	60	FV	85.0	11003	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	4	2009	WD	Normal	310000
1443	1444	30	RL	NaN	8854	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2009	WD	Normal	121000
1444	1445	20	RL	63.0	8500	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	11	2007	WD	Normal	179600
1445	1446	85	RL	70.0	8400	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2007	WD	Normal	129000
1446	1447	20	RL	NaN	26142	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	4	2010	WD	Normal	157900
1447	1448	60	RL	80.0	10000	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	12	2007	WD	Normal	240000
1448	1449	50	RL	70.0	11767	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	GdWo	NaN	0	5	2007	WD	Normal	112000
1449	1450	180	RM	21.0	1533	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	8	2006	WD	Abnorml	92000
1450	1451	90	RL	60.0	9000	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	9	2009	WD	Normal	136000
1451	1452	20	RL	78.0	9262	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2009	New	Partial	287090
1452	1453	180	RM	35.0	3675	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2006	WD	Normal	145000
1453	1454	20	RL	90.0	17217	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	7	2006	WD	Abnorml	84500
1454	1455	20	FV	62.0	7500	Pave	Pave	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	10	2009	WD	Normal	185000
1455	1456	60	RL	62.0	7917	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	8	2007	WD	Normal	175000
1456	1457	20	RL	85.0	13175	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	2	2010	WD	Normal	210000
1457	1458	70	RL	66.0	9042	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	GdPrv	Shed	2500	5	2010	WD	Normal	266500
1458	1459	20	RL	68.0	9717	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	4	2010	WD	Normal	142125
1459	1460	20	RL	75.0	9937	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2008	WD	Normal	147500

1460 rows × 81 columns

In [3]:

# print a summary of the data in train data
print(df.describe())

                Id   MSSubClass  LotFrontage        LotArea  OverallQual  \
count  1460.000000  1460.000000  1201.000000    1460.000000  1460.000000   
mean    730.500000    56.897260    70.049958   10516.828082     6.099315   
std     421.610009    42.300571    24.284752    9981.264932     1.382997   
min       1.000000    20.000000    21.000000    1300.000000     1.000000   
25%     365.750000    20.000000    59.000000    7553.500000     5.000000   
50%     730.500000    50.000000    69.000000    9478.500000     6.000000   
75%    1095.250000    70.000000    80.000000   11601.500000     7.000000   
max    1460.000000   190.000000   313.000000  215245.000000    10.000000   

       OverallCond    YearBuilt  YearRemodAdd   MasVnrArea   BsmtFinSF1  \
count  1460.000000  1460.000000   1460.000000  1452.000000  1460.000000   
mean      5.575342  1971.267808   1984.865753   103.685262   443.639726   
std       1.112799    30.202904     20.645407   181.066207   456.098091   
min       1.000000  1872.000000   1950.000000     0.000000     0.000000   
25%       5.000000  1954.000000   1967.000000     0.000000     0.000000   
50%       5.000000  1973.000000   1994.000000     0.000000   383.500000   
75%       6.000000  2000.000000   2004.000000   166.000000   712.250000   
max       9.000000  2010.000000   2010.000000  1600.000000  5644.000000   

           ...         WoodDeckSF  OpenPorchSF  EnclosedPorch    3SsnPorch  \
count      ...        1460.000000  1460.000000    1460.000000  1460.000000   
mean       ...          94.244521    46.660274      21.954110     3.409589   
std        ...         125.338794    66.256028      61.119149    29.317331   
min        ...           0.000000     0.000000       0.000000     0.000000   
25%        ...           0.000000     0.000000       0.000000     0.000000   
50%        ...           0.000000    25.000000       0.000000     0.000000   
75%        ...         168.000000    68.000000       0.000000     0.000000   
max        ...         857.000000   547.000000     552.000000   508.000000   

       ScreenPorch     PoolArea       MiscVal       MoSold       YrSold  \
count  1460.000000  1460.000000   1460.000000  1460.000000  1460.000000   
mean     15.060959     2.758904     43.489041     6.321918  2007.815753   
std      55.757415    40.177307    496.123024     2.703626     1.328095   
min       0.000000     0.000000      0.000000     1.000000  2006.000000   
25%       0.000000     0.000000      0.000000     5.000000  2007.000000   
50%       0.000000     0.000000      0.000000     6.000000  2008.000000   
75%       0.000000     0.000000      0.000000     8.000000  2009.000000   
max     480.000000   738.000000  15500.000000    12.000000  2010.000000   

           SalePrice  
count    1460.000000  
mean   180921.195890  
std     79442.502883  
min     34900.000000  
25%    129975.000000  
50%    163000.000000  
75%    214000.000000  
max    755000.000000  

[8 rows x 38 columns]

Interpreting Data Description¶

The results show 8 numbers for each column in your original dataset. The first number, the count, shows how many rows have non-missing values.

The second value is the mean, which is the average. Under that, sd is the standard deviation, which measures how numerically spread out the values are.

To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analgously, and the max is the largest number.

Using Spark and Optimus to Get Familiar With Your Data¶

Optimus (By Iron)is the missing framework for cleaning, pre-processing and exploring data in a distributed fashion. It uses all the power of Apache Spark (optimized via Catalyst) and Python to do so. It implements several handy tools for data wrangling and munging that will make your life much easier. The first obvious advantage over any other public data cleaning library or framework is that it will work on your laptop or your big cluster, and second, it is amazingly easy to install, use and understand.

Install optimus with:¶

In [4]:

!pip install optimuspyspark

Requirement already satisfied: optimuspyspark in /Users/faviovazquez/anaconda/lib/python3.6/site-packages
Requirement already satisfied: pytest in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark)
Requirement already satisfied: pixiedust-optimus in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark)
Requirement already satisfied: spark-df-profiling-optimus in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark)
Requirement already satisfied: pytest-spark in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark)
Requirement already satisfied: seaborn in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark)
Requirement already satisfied: findspark in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark)
Requirement already satisfied: pyspark in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark)
Requirement already satisfied: py>=1.4.29 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from pytest->optimuspyspark)
Requirement already satisfied: setuptools in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from pytest->optimuspyspark)
Requirement already satisfied: geojson in /Users/faviovazquez/.local/lib/python3.6/site-packages (from pixiedust-optimus->optimuspyspark)
Requirement already satisfied: lxml in /Users/faviovazquez/.local/lib/python3.6/site-packages (from pixiedust-optimus->optimuspyspark)
Requirement already satisfied: mpld3 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from pixiedust-optimus->optimuspyspark)
Requirement already satisfied: jinja2>=2.8 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from spark-df-profiling-optimus->optimuspyspark)
Requirement already satisfied: pandas>=0.17.0 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from spark-df-profiling-optimus->optimuspyspark)
Requirement already satisfied: matplotlib>=2.0 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from spark-df-profiling-optimus->optimuspyspark)
Requirement already satisfied: six>=1.9.0 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from spark-df-profiling-optimus->optimuspyspark)
Requirement already satisfied: numpy in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from seaborn->optimuspyspark)
Requirement already satisfied: scipy in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from seaborn->optimuspyspark)
Requirement already satisfied: py4j==0.10.4 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from pyspark->optimuspyspark)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from jinja2>=2.8->spark-df-profiling-optimus->optimuspyspark)
Requirement already satisfied: python-dateutil>=2 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from pandas>=0.17.0->spark-df-profiling-optimus->optimuspyspark)
Requirement already satisfied: pytz>=2011k in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from pandas>=0.17.0->spark-df-profiling-optimus->optimuspyspark)
Requirement already satisfied: cycler>=0.10 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from matplotlib>=2.0->spark-df-profiling-optimus->optimuspyspark)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from matplotlib>=2.0->spark-df-profiling-optimus->optimuspyspark)

Using Optimus is really easy, and you have spark underneath :)

In [5]:

import optimus as op

Starting or getting SparkSession and SparkContext.

Setting checkpoint folder (local). If you are in a cluster change it with set_check_point_folder(path,'hadoop').

Deleting previous folder if exists...
Creation of checkpoint directory...
Done.

Optimus successfully imported. Have fun :).

In [6]:

tools = op.Utilities()

In [7]:

df = tools.read_csv("train.csv")

In [8]:

df.show()

+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
| Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition2|BldgType|HouseStyle|OverallQual|OverallCond|YearBuilt|YearRemodAdd|RoofStyle|RoofMatl|Exterior1st|Exterior2nd|MasVnrType|MasVnrArea|ExterQual|ExterCond|Foundation|BsmtQual|BsmtCond|BsmtExposure|BsmtFinType1|BsmtFinSF1|BsmtFinType2|BsmtFinSF2|BsmtUnfSF|TotalBsmtSF|Heating|HeatingQC|CentralAir|Electrical|1stFlrSF|2ndFlrSF|LowQualFinSF|GrLivArea|BsmtFullBath|BsmtHalfBath|FullBath|HalfBath|BedroomAbvGr|KitchenAbvGr|KitchenQual|TotRmsAbvGrd|Functional|Fireplaces|FireplaceQu|GarageType|GarageYrBlt|GarageFinish|GarageCars|GarageArea|GarageQual|GarageCond|PavedDrive|WoodDeckSF|OpenPorchSF|EnclosedPorch|3SsnPorch|ScreenPorch|PoolArea|PoolQC|Fence|MiscFeature|MiscVal|MoSold|YrSold|SaleType|SaleCondition|SalePrice|
+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
|  1|        60|      RL|         65|   8450|  Pave|   NA|     Reg|        Lvl|   AllPub|   Inside|      Gtl|     CollgCr|      Norm|      Norm|    1Fam|    2Story|          7|          5|     2003|        2003|    Gable| CompShg|    VinylSd|    VinylSd|   BrkFace|       196|       Gd|       TA|     PConc|      Gd|      TA|          No|         GLQ|       706|         Unf|         0|      150|        856|   GasA|       Ex|         Y|     SBrkr|     856|     854|           0|     1710|           1|           0|       2|       1|           3|           1|         Gd|           8|       Typ|         0|         NA|    Attchd|       2003|         RFn|         2|       548|        TA|        TA|         Y|         0|         61|            0|        0|          0|       0|    NA|   NA|         NA|      0|     2|  2008|      WD|       Normal|   208500|
|  2|        20|      RL|         80|   9600|  Pave|   NA|     Reg|        Lvl|   AllPub|      FR2|      Gtl|     Veenker|     Feedr|      Norm|    1Fam|    1Story|          6|          8|     1976|        1976|    Gable| CompShg|    MetalSd|    MetalSd|      None|         0|       TA|       TA|    CBlock|      Gd|      TA|          Gd|         ALQ|       978|         Unf|         0|      284|       1262|   GasA|       Ex|         Y|     SBrkr|    1262|       0|           0|     1262|           0|           1|       2|       0|           3|           1|         TA|           6|       Typ|         1|         TA|    Attchd|       1976|         RFn|         2|       460|        TA|        TA|         Y|       298|          0|            0|        0|          0|       0|    NA|   NA|         NA|      0|     5|  2007|      WD|       Normal|   181500|
|  3|        60|      RL|         68|  11250|  Pave|   NA|     IR1|        Lvl|   AllPub|   Inside|      Gtl|     CollgCr|      Norm|      Norm|    1Fam|    2Story|          7|          5|     2001|        2002|    Gable| CompShg|    VinylSd|    VinylSd|   BrkFace|       162|       Gd|       TA|     PConc|      Gd|      TA|          Mn|         GLQ|       486|         Unf|         0|      434|        920|   GasA|       Ex|         Y|     SBrkr|     920|     866|           0|     1786|           1|           0|       2|       1|           3|           1|         Gd|           6|       Typ|         1|         TA|    Attchd|       2001|         RFn|         2|       608|        TA|        TA|         Y|         0|         42|            0|        0|          0|       0|    NA|   NA|         NA|      0|     9|  2008|      WD|       Normal|   223500|
|  4|        70|      RL|         60|   9550|  Pave|   NA|     IR1|        Lvl|   AllPub|   Corner|      Gtl|     Crawfor|      Norm|      Norm|    1Fam|    2Story|          7|          5|     1915|        1970|    Gable| CompShg|    Wd Sdng|    Wd Shng|      None|         0|       TA|       TA|    BrkTil|      TA|      Gd|          No|         ALQ|       216|         Unf|         0|      540|        756|   GasA|       Gd|         Y|     SBrkr|     961|     756|           0|     1717|           1|           0|       1|       0|           3|           1|         Gd|           7|       Typ|         1|         Gd|    Detchd|       1998|         Unf|         3|       642|        TA|        TA|         Y|         0|         35|          272|        0|          0|       0|    NA|   NA|         NA|      0|     2|  2006|      WD|      Abnorml|   140000|
|  5|        60|      RL|         84|  14260|  Pave|   NA|     IR1|        Lvl|   AllPub|      FR2|      Gtl|     NoRidge|      Norm|      Norm|    1Fam|    2Story|          8|          5|     2000|        2000|    Gable| CompShg|    VinylSd|    VinylSd|   BrkFace|       350|       Gd|       TA|     PConc|      Gd|      TA|          Av|         GLQ|       655|         Unf|         0|      490|       1145|   GasA|       Ex|         Y|     SBrkr|    1145|    1053|           0|     2198|           1|           0|       2|       1|           4|           1|         Gd|           9|       Typ|         1|         TA|    Attchd|       2000|         RFn|         3|       836|        TA|        TA|         Y|       192|         84|            0|        0|          0|       0|    NA|   NA|         NA|      0|    12|  2008|      WD|       Normal|   250000|
|  6|        50|      RL|         85|  14115|  Pave|   NA|     IR1|        Lvl|   AllPub|   Inside|      Gtl|     Mitchel|      Norm|      Norm|    1Fam|    1.5Fin|          5|          5|     1993|        1995|    Gable| CompShg|    VinylSd|    VinylSd|      None|         0|       TA|       TA|      Wood|      Gd|      TA|          No|         GLQ|       732|         Unf|         0|       64|        796|   GasA|       Ex|         Y|     SBrkr|     796|     566|           0|     1362|           1|           0|       1|       1|           1|           1|         TA|           5|       Typ|         0|         NA|    Attchd|       1993|         Unf|         2|       480|        TA|        TA|         Y|        40|         30|            0|      320|          0|       0|    NA|MnPrv|       Shed|    700|    10|  2009|      WD|       Normal|   143000|
|  7|        20|      RL|         75|  10084|  Pave|   NA|     Reg|        Lvl|   AllPub|   Inside|      Gtl|     Somerst|      Norm|      Norm|    1Fam|    1Story|          8|          5|     2004|        2005|    Gable| CompShg|    VinylSd|    VinylSd|     Stone|       186|       Gd|       TA|     PConc|      Ex|      TA|          Av|         GLQ|      1369|         Unf|         0|      317|       1686|   GasA|       Ex|         Y|     SBrkr|    1694|       0|           0|     1694|           1|           0|       2|       0|           3|           1|         Gd|           7|       Typ|         1|         Gd|    Attchd|       2004|         RFn|         2|       636|        TA|        TA|         Y|       255|         57|            0|        0|          0|       0|    NA|   NA|         NA|      0|     8|  2007|      WD|       Normal|   307000|
|  8|        60|      RL|         NA|  10382|  Pave|   NA|     IR1|        Lvl|   AllPub|   Corner|      Gtl|      NWAmes|      PosN|      Norm|    1Fam|    2Story|          7|          6|     1973|        1973|    Gable| CompShg|    HdBoard|    HdBoard|     Stone|       240|       TA|       TA|    CBlock|      Gd|      TA|          Mn|         ALQ|       859|         BLQ|        32|      216|       1107|   GasA|       Ex|         Y|     SBrkr|    1107|     983|           0|     2090|           1|           0|       2|       1|           3|           1|         TA|           7|       Typ|         2|         TA|    Attchd|       1973|         RFn|         2|       484|        TA|        TA|         Y|       235|        204|          228|        0|          0|       0|    NA|   NA|       Shed|    350|    11|  2009|      WD|       Normal|   200000|
|  9|        50|      RM|         51|   6120|  Pave|   NA|     Reg|        Lvl|   AllPub|   Inside|      Gtl|     OldTown|    Artery|      Norm|    1Fam|    1.5Fin|          7|          5|     1931|        1950|    Gable| CompShg|    BrkFace|    Wd Shng|      None|         0|       TA|       TA|    BrkTil|      TA|      TA|          No|         Unf|         0|         Unf|         0|      952|        952|   GasA|       Gd|         Y|     FuseF|    1022|     752|           0|     1774|           0|           0|       2|       0|           2|           2|         TA|           8|      Min1|         2|         TA|    Detchd|       1931|         Unf|         2|       468|        Fa|        TA|         Y|        90|          0|          205|        0|          0|       0|    NA|   NA|         NA|      0|     4|  2008|      WD|      Abnorml|   129900|
| 10|       190|      RL|         50|   7420|  Pave|   NA|     Reg|        Lvl|   AllPub|   Corner|      Gtl|     BrkSide|    Artery|    Artery|  2fmCon|    1.5Unf|          5|          6|     1939|        1950|    Gable| CompShg|    MetalSd|    MetalSd|      None|         0|       TA|       TA|    BrkTil|      TA|      TA|          No|         GLQ|       851|         Unf|         0|      140|        991|   GasA|       Ex|         Y|     SBrkr|    1077|       0|           0|     1077|           1|           0|       1|       0|           2|           2|         TA|           5|       Typ|         2|         TA|    Attchd|       1939|         RFn|         1|       205|        Gd|        TA|         Y|         0|          4|            0|        0|          0|       0|    NA|   NA|         NA|      0|     1|  2008|      WD|       Normal|   118000|
| 11|        20|      RL|         70|  11200|  Pave|   NA|     Reg|        Lvl|   AllPub|   Inside|      Gtl|      Sawyer|      Norm|      Norm|    1Fam|    1Story|          5|          5|     1965|        1965|      Hip| CompShg|    HdBoard|    HdBoard|      None|         0|       TA|       TA|    CBlock|      TA|      TA|          No|         Rec|       906|         Unf|         0|      134|       1040|   GasA|       Ex|         Y|     SBrkr|    1040|       0|           0|     1040|           1|           0|       1|       0|           3|           1|         TA|           5|       Typ|         0|         NA|    Detchd|       1965|         Unf|         1|       384|        TA|        TA|         Y|         0|          0|            0|        0|          0|       0|    NA|   NA|         NA|      0|     2|  2008|      WD|       Normal|   129500|
| 12|        60|      RL|         85|  11924|  Pave|   NA|     IR1|        Lvl|   AllPub|   Inside|      Gtl|     NridgHt|      Norm|      Norm|    1Fam|    2Story|          9|          5|     2005|        2006|      Hip| CompShg|    WdShing|    Wd Shng|     Stone|       286|       Ex|       TA|     PConc|      Ex|      TA|          No|         GLQ|       998|         Unf|         0|      177|       1175|   GasA|       Ex|         Y|     SBrkr|    1182|    1142|           0|     2324|           1|           0|       3|       0|           4|           1|         Ex|          11|       Typ|         2|         Gd|   BuiltIn|       2005|         Fin|         3|       736|        TA|        TA|         Y|       147|         21|            0|        0|          0|       0|    NA|   NA|         NA|      0|     7|  2006|     New|      Partial|   345000|
| 13|        20|      RL|         NA|  12968|  Pave|   NA|     IR2|        Lvl|   AllPub|   Inside|      Gtl|      Sawyer|      Norm|      Norm|    1Fam|    1Story|          5|          6|     1962|        1962|      Hip| CompShg|    HdBoard|    Plywood|      None|         0|       TA|       TA|    CBlock|      TA|      TA|          No|         ALQ|       737|         Unf|         0|      175|        912|   GasA|       TA|         Y|     SBrkr|     912|       0|           0|      912|           1|           0|       1|       0|           2|           1|         TA|           4|       Typ|         0|         NA|    Detchd|       1962|         Unf|         1|       352|        TA|        TA|         Y|       140|          0|            0|        0|        176|       0|    NA|   NA|         NA|      0|     9|  2008|      WD|       Normal|   144000|
| 14|        20|      RL|         91|  10652|  Pave|   NA|     IR1|        Lvl|   AllPub|   Inside|      Gtl|     CollgCr|      Norm|      Norm|    1Fam|    1Story|          7|          5|     2006|        2007|    Gable| CompShg|    VinylSd|    VinylSd|     Stone|       306|       Gd|       TA|     PConc|      Gd|      TA|          Av|         Unf|         0|         Unf|         0|     1494|       1494|   GasA|       Ex|         Y|     SBrkr|    1494|       0|           0|     1494|           0|           0|       2|       0|           3|           1|         Gd|           7|       Typ|         1|         Gd|    Attchd|       2006|         RFn|         3|       840|        TA|        TA|         Y|       160|         33|            0|        0|          0|       0|    NA|   NA|         NA|      0|     8|  2007|     New|      Partial|   279500|
| 15|        20|      RL|         NA|  10920|  Pave|   NA|     IR1|        Lvl|   AllPub|   Corner|      Gtl|       NAmes|      Norm|      Norm|    1Fam|    1Story|          6|          5|     1960|        1960|      Hip| CompShg|    MetalSd|    MetalSd|   BrkFace|       212|       TA|       TA|    CBlock|      TA|      TA|          No|         BLQ|       733|         Unf|         0|      520|       1253|   GasA|       TA|         Y|     SBrkr|    1253|       0|           0|     1253|           1|           0|       1|       1|           2|           1|         TA|           5|       Typ|         1|         Fa|    Attchd|       1960|         RFn|         1|       352|        TA|        TA|         Y|         0|        213|          176|        0|          0|       0|    NA| GdWo|         NA|      0|     5|  2008|      WD|       Normal|   157000|
| 16|        45|      RM|         51|   6120|  Pave|   NA|     Reg|        Lvl|   AllPub|   Corner|      Gtl|     BrkSide|      Norm|      Norm|    1Fam|    1.5Unf|          7|          8|     1929|        2001|    Gable| CompShg|    Wd Sdng|    Wd Sdng|      None|         0|       TA|       TA|    BrkTil|      TA|      TA|          No|         Unf|         0|         Unf|         0|      832|        832|   GasA|       Ex|         Y|     FuseA|     854|       0|           0|      854|           0|           0|       1|       0|           2|           1|         TA|           5|       Typ|         0|         NA|    Detchd|       1991|         Unf|         2|       576|        TA|        TA|         Y|        48|        112|            0|        0|          0|       0|    NA|GdPrv|         NA|      0|     7|  2007|      WD|       Normal|   132000|
| 17|        20|      RL|         NA|  11241|  Pave|   NA|     IR1|        Lvl|   AllPub|  CulDSac|      Gtl|       NAmes|      Norm|      Norm|    1Fam|    1Story|          6|          7|     1970|        1970|    Gable| CompShg|    Wd Sdng|    Wd Sdng|   BrkFace|       180|       TA|       TA|    CBlock|      TA|      TA|          No|         ALQ|       578|         Unf|         0|      426|       1004|   GasA|       Ex|         Y|     SBrkr|    1004|       0|           0|     1004|           1|           0|       1|       0|           2|           1|         TA|           5|       Typ|         1|         TA|    Attchd|       1970|         Fin|         2|       480|        TA|        TA|         Y|         0|          0|            0|        0|          0|       0|    NA|   NA|       Shed|    700|     3|  2010|      WD|       Normal|   149000|
| 18|        90|      RL|         72|  10791|  Pave|   NA|     Reg|        Lvl|   AllPub|   Inside|      Gtl|      Sawyer|      Norm|      Norm|  Duplex|    1Story|          4|          5|     1967|        1967|    Gable| CompShg|    MetalSd|    MetalSd|      None|         0|       TA|       TA|      Slab|      NA|      NA|          NA|          NA|         0|          NA|         0|        0|          0|   GasA|       TA|         Y|     SBrkr|    1296|       0|           0|     1296|           0|           0|       2|       0|           2|           2|         TA|           6|       Typ|         0|         NA|   CarPort|       1967|         Unf|         2|       516|        TA|        TA|         Y|         0|          0|            0|        0|          0|       0|    NA|   NA|       Shed|    500|    10|  2006|      WD|       Normal|    90000|
| 19|        20|      RL|         66|  13695|  Pave|   NA|     Reg|        Lvl|   AllPub|   Inside|      Gtl|     SawyerW|      RRAe|      Norm|    1Fam|    1Story|          5|          5|     2004|        2004|    Gable| CompShg|    VinylSd|    VinylSd|      None|         0|       TA|       TA|     PConc|      TA|      TA|          No|         GLQ|       646|         Unf|         0|      468|       1114|   GasA|       Ex|         Y|     SBrkr|    1114|       0|           0|     1114|           1|           0|       1|       1|           3|           1|         Gd|           6|       Typ|         0|         NA|    Detchd|       2004|         Unf|         2|       576|        TA|        TA|         Y|         0|        102|            0|        0|          0|       0|    NA|   NA|         NA|      0|     6|  2008|      WD|       Normal|   159000|
| 20|        20|      RL|         70|   7560|  Pave|   NA|     Reg|        Lvl|   AllPub|   Inside|      Gtl|       NAmes|      Norm|      Norm|    1Fam|    1Story|          5|          6|     1958|        1965|      Hip| CompShg|    BrkFace|    Plywood|      None|         0|       TA|       TA|    CBlock|      TA|      TA|          No|         LwQ|       504|         Unf|         0|      525|       1029|   GasA|       TA|         Y|     SBrkr|    1339|       0|           0|     1339|           0|           0|       1|       0|           3|           1|         TA|           6|      Min1|         0|         NA|    Attchd|       1958|         Unf|         1|       294|        TA|        TA|         Y|         0|          0|            0|        0|          0|       0|    NA|MnPrv|         NA|      0|     5|  2009|     COD|      Abnorml|   139000|
+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
only showing top 20 rows

In [9]:

profiler = op.DataFrameProfiler(df)

In [10]:

profiler.profiler()

Out[10]:

Overview

Dataset info

Number of variables	81
Number of observations	1460
Total Missing (%)	0.0%
Total size in memory	0.0 B
Average record size in memory	0.0 B

Variables types

Numeric	35
Categorical	46
Date	0
Text (Unique)	0
Rejected	0

Warnings

2ndFlrSF has 829 / 56.8% zeros
3SsnPorch has 1436 / 98.4% zeros
BsmtFinSF1 has 467 / 32.0% zeros
BsmtFinSF2 has 1293 / 88.6% zeros
BsmtFullBath has 856 / 58.6% zeros
BsmtHalfBath has 1378 / 94.4% zeros
BsmtUnfSF has 118 / 8.1% zeros
EnclosedPorch has 1252 / 85.8% zeros
Fireplaces has 690 / 47.3% zeros
GarageArea has 81 / 5.5% zeros
GarageCars has 81 / 5.5% zeros
GarageYrBlt has a high cardinality: 98 distinct values Warning
HalfBath has 913 / 62.5% zeros
LotFrontage has a high cardinality: 111 distinct values Warning
LowQualFinSF has 1434 / 98.2% zeros
MasVnrArea has a high cardinality: 328 distinct values Warning
MiscVal has 1408 / 96.4% zeros
MiscVal is highly skewed (γ1 = 24.452)
OpenPorchSF has 656 / 44.9% zeros
PoolArea has 1453 / 99.5% zeros
ScreenPorch has 1344 / 92.1% zeros
TotalBsmtSF has 37 / 2.5% zeros
WoodDeckSF has 761 / 52.1% zeros

Variables

1stFlrSF
Numeric

Distinct count	753
Unique (%)	51.6%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	1162.6
Minimum	334
Maximum	4692
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	334
5-th percentile	672.95
Q1	882
Median	1087
Q3	1391.2
95-th percentile	1831.2
Maximum	4692
Range	4358
Interquartile range	509.25

Descriptive statistics

Standard deviation	386.59
Coef of variation	0.33251
Kurtosis	5.7221
Mean	1162.6
MAD	300.58
Skewness	1.3753
Sum	1697400
Variance	149450
Memory size	0.0 B

2ndFlrSF
Numeric

Distinct count	417
Unique (%)	28.6%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	346.99
Minimum	0
Maximum	2065
Zeros (%)	56.8%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	728
95-th percentile	1141
Maximum	2065
Range	2065
Interquartile range	728

Descriptive statistics

Standard deviation	436.53
Coef of variation	1.258
Kurtosis	-0.55568
Mean	346.99
MAD	396.48
Skewness	0.81219
Sum	506610
Variance	190560
Memory size	0.0 B

3SsnPorch
Numeric

Distinct count	20
Unique (%)	1.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	3.4096
Minimum	0
Maximum	508
Zeros (%)	98.4%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	0
95-th percentile	0
Maximum	508
Range	508
Interquartile range	0

Descriptive statistics

Standard deviation	29.317
Coef of variation	8.5985
Kurtosis	123.24
Mean	3.4096
MAD	6.7071
Skewness	10.294
Sum	4978
Variance	859.51
Memory size	0.0 B

Alley
Categorical

Distinct count	3
Unique (%)	0.2%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

NA	1369
Grvl	50
Pave	41

Toggle details

Value	Count	Frequency (%)
NA	1369	93.8%
Grvl	50	3.4%
Pave	41	2.8%

BedroomAbvGr
Numeric

Distinct count	8
Unique (%)	0.5%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	2.8664
Minimum	0
Maximum	8
Zeros (%)	0.4%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	2
Q1	2
Median	3
Q3	3
95-th percentile	4
Maximum	8
Range	8
Interquartile range	1

Descriptive statistics

Standard deviation	0.81578
Coef of variation	0.2846
Kurtosis	2.2191
Mean	2.8664
MAD	0.57631
Skewness	0.21157
Sum	4185
Variance	0.66549
Memory size	0.0 B

BldgType
Categorical

Distinct count	5
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

1Fam	1220
TwnhsE	114
Duplex	52
Other values (2)	74

Toggle details

Value	Count	Frequency (%)
1Fam	1220	83.6%
TwnhsE	114	7.8%
Duplex	52	3.6%
Twnhs	43	2.9%
2fmCon	31	2.1%

BsmtCond
Categorical

Distinct count	5
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

TA	1311
Gd	65
Fa	45
Other values (2)	39

Toggle details

Value	Count	Frequency (%)
TA	1311	89.8%
Gd	65	4.5%
Fa	45	3.1%
NA	37	2.5%
Po	2	0.1%

BsmtExposure
Categorical

Distinct count	5
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

No	953
Av	221
Gd	134
Other values (2)	152

Toggle details

Value	Count	Frequency (%)
No	953	65.3%
Av	221	15.1%
Gd	134	9.2%
Mn	114	7.8%
NA	38	2.6%

BsmtFinSF1
Numeric

Distinct count	637
Unique (%)	43.6%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	443.64
Minimum	0
Maximum	5644
Zeros (%)	32.0%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	383.5
Q3	712.25
95-th percentile	1274
Maximum	5644
Range	5644
Interquartile range	712.25

Descriptive statistics

Standard deviation	456.1
Coef of variation	1.0281
Kurtosis	11.076
Mean	443.64
MAD	367.37
Skewness	1.6838
Sum	647710
Variance	208030
Memory size	0.0 B

BsmtFinSF2
Numeric

Distinct count	144
Unique (%)	9.9%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	46.549
Minimum	0
Maximum	1474
Zeros (%)	88.6%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	0
95-th percentile	396.2
Maximum	1474
Range	1474
Interquartile range	0

Descriptive statistics

Standard deviation	161.32
Coef of variation	3.4656
Kurtosis	20.04
Mean	46.549
MAD	82.535
Skewness	4.2509
Sum	67962
Variance	26024
Memory size	0.0 B

BsmtFinType1
Categorical

Distinct count	7
Unique (%)	0.5%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Unf	430
GLQ	418
ALQ	220
Other values (4)	392

Toggle details

Value	Count	Frequency (%)
Unf	430	29.5%
GLQ	418	28.6%
ALQ	220	15.1%
BLQ	148	10.1%
Rec	133	9.1%
LwQ	74	5.1%
NA	37	2.5%

BsmtFinType2
Categorical

Distinct count	7
Unique (%)	0.5%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Unf	1256
Rec	54
LwQ	46
Other values (4)	104

Toggle details

Value	Count	Frequency (%)
Unf	1256	86.0%
Rec	54	3.7%
LwQ	46	3.2%
NA	38	2.6%
BLQ	33	2.3%
ALQ	19	1.3%
GLQ	14	1.0%

BsmtFullBath
Numeric

Distinct count	4
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	0.42534
Minimum	0
Maximum	3
Zeros (%)	58.6%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	1
95-th percentile	1
Maximum	3
Range	3
Interquartile range	1

Descriptive statistics

Standard deviation	0.51891
Coef of variation	1.22
Kurtosis	-0.84033
Mean	0.42534
MAD	0.49876
Skewness	0.59545
Sum	621
Variance	0.26927
Memory size	0.0 B

BsmtHalfBath
Numeric

Distinct count	3
Unique (%)	0.2%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	0.057534
Minimum	0
Maximum	2
Zeros (%)	94.4%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	0
95-th percentile	1
Maximum	2
Range	2
Interquartile range	0

Descriptive statistics

Standard deviation	0.23875
Coef of variation	4.1497
Kurtosis	16.336
Mean	0.057534
MAD	0.10861
Skewness	4.0992
Sum	84
Variance	0.057003
Memory size	0.0 B

BsmtQual
Categorical

Distinct count	5
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

TA	649
Gd	618
Ex	121
Other values (2)	72

Toggle details

Value	Count	Frequency (%)
TA	649	44.5%
Gd	618	42.3%
Ex	121	8.3%
NA	37	2.5%
Fa	35	2.4%

BsmtUnfSF
Numeric

Distinct count	780
Unique (%)	53.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	567.24
Minimum	0
Maximum	2336
Zeros (%)	8.1%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	223
Median	477.5
Q3	808
95-th percentile	1468
Maximum	2336
Range	2336
Interquartile range	585

Descriptive statistics

Standard deviation	441.87
Coef of variation	0.77898
Kurtosis	0.46926
Mean	567.24
MAD	353.28
Skewness	0.91932
Sum	828170
Variance	195250
Memory size	0.0 B

CentralAir
Categorical

Distinct count	2
Unique (%)	0.1%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Y	1365
N	95

Toggle details

Value	Count	Frequency (%)
Y	1365	93.5%
N	95	6.5%

Condition1
Categorical

Distinct count	9
Unique (%)	0.6%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Norm	1260
Feedr	81
Artery	48
Other values (6)	71

Toggle details

Value	Count	Frequency (%)
Norm	1260	86.3%
Feedr	81	5.5%
Artery	48	3.3%
RRAn	26	1.8%
PosN	19	1.3%
RRAe	11	0.8%
PosA	8	0.5%
RRNn	5	0.3%
RRNe	2	0.1%

Condition2
Categorical

Distinct count	8
Unique (%)	0.5%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Norm	1445
Feedr	6
PosN	2
Other values (5)	7

Toggle details

Value	Count	Frequency (%)
Norm	1445	99.0%
Feedr	6	0.4%
PosN	2	0.1%
Artery	2	0.1%
RRNn	2	0.1%
RRAn	1	0.1%
PosA	1	0.1%
RRAe	1	0.1%

Electrical
Categorical

Distinct count	6
Unique (%)	0.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

SBrkr	1334
FuseA	94
FuseF	27
Other values (3)	5

Toggle details

Value	Count	Frequency (%)
SBrkr	1334	91.4%
FuseA	94	6.4%
FuseF	27	1.8%
FuseP	3	0.2%
Mix	1	0.1%
NA	1	0.1%

EnclosedPorch
Numeric

Distinct count	120
Unique (%)	8.2%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	21.954
Minimum	0
Maximum	552
Zeros (%)	85.8%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	0
95-th percentile	180.15
Maximum	552
Range	552
Interquartile range	0

Descriptive statistics

Standard deviation	61.119
Coef of variation	2.784
Kurtosis	10.391
Mean	21.954
MAD	37.66
Skewness	3.0867
Sum	32053
Variance	3735.6
Memory size	0.0 B

ExterCond
Categorical

Distinct count	5
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

TA	1282
Gd	146
Fa	28
Other values (2)	4

Toggle details

Value	Count	Frequency (%)
TA	1282	87.8%
Gd	146	10.0%
Fa	28	1.9%
Ex	3	0.2%
Po	1	0.1%

ExterQual
Categorical

Distinct count	4
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

TA	906
Gd	488
Ex	52

Toggle details

Value	Count	Frequency (%)
TA	906	62.1%
Gd	488	33.4%
Ex	52	3.6%
Fa	14	1.0%

Exterior1st
Categorical

Distinct count	15
Unique (%)	1.0%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

VinylSd	515
HdBoard	222
MetalSd	220
Other values (12)	503

Toggle details

Value	Count	Frequency (%)
VinylSd	515	35.3%
HdBoard	222	15.2%
MetalSd	220	15.1%
Wd Sdng	206	14.1%
Plywood	108	7.4%
CemntBd	61	4.2%
BrkFace	50	3.4%
WdShing	26	1.8%
Stucco	25	1.7%
AsbShng	20	1.4%
Stone	2	0.1%
BrkComm	2	0.1%
AsphShn	1	0.1%
ImStucc	1	0.1%
CBlock	1	0.1%

Exterior2nd
Categorical

Distinct count	16
Unique (%)	1.1%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

VinylSd	504
MetalSd	214
HdBoard	207
Other values (13)	535

Toggle details

Value	Count	Frequency (%)
VinylSd	504	34.5%
MetalSd	214	14.7%
HdBoard	207	14.2%
Wd Sdng	197	13.5%
Plywood	142	9.7%
CmentBd	60	4.1%
Wd Shng	38	2.6%
Stucco	26	1.8%
BrkFace	25	1.7%
AsbShng	20	1.4%
ImStucc	10	0.7%
Brk Cmn	7	0.5%
Stone	5	0.3%
AsphShn	3	0.2%
Other	1	0.1%
CBlock	1	0.1%

Fence
Categorical

Distinct count	5
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

NA	1179
MnPrv	157
GdPrv	59
Other values (2)	65

Toggle details

Value	Count	Frequency (%)
NA	1179	80.8%
MnPrv	157	10.8%
GdPrv	59	4.0%
GdWo	54	3.7%
MnWw	11	0.8%

FireplaceQu
Categorical

Distinct count	6
Unique (%)	0.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

NA	690
Gd	380
TA	313
Other values (3)	77

Toggle details

Value	Count	Frequency (%)
NA	690	47.3%
Gd	380	26.0%
TA	313	21.4%
Fa	33	2.3%
Ex	24	1.6%
Po	20	1.4%

Fireplaces
Numeric

Distinct count	4
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	0.61301
Minimum	0
Maximum	3
Zeros (%)	47.3%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	1
Q3	1
95-th percentile	2
Maximum	3
Range	3
Interquartile range	1

Descriptive statistics

Standard deviation	0.64467
Coef of variation	1.0516
Kurtosis	-0.2206
Mean	0.61301
MAD	0.57942
Skewness	0.6489
Sum	895
Variance	0.41559
Memory size	0.0 B

Foundation
Categorical

Distinct count	6
Unique (%)	0.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

PConc	647
CBlock	634
BrkTil	146
Other values (3)	33

Toggle details

Value	Count	Frequency (%)
PConc	647	44.3%
CBlock	634	43.4%
BrkTil	146	10.0%
Slab	24	1.6%
Stone	6	0.4%
Wood	3	0.2%

FullBath
Numeric

Distinct count	4
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	1.5651
Minimum	0
Maximum	3
Zeros (%)	0.6%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	1
Q1	1
Median	2
Q3	2
95-th percentile	2
Maximum	3
Range	3
Interquartile range	1

Descriptive statistics

Standard deviation	0.55092
Coef of variation	0.35201
Kurtosis	-0.85822
Mean	1.5651
MAD	0.52244
Skewness	0.036524
Sum	2285
Variance	0.30351
Memory size	0.0 B

Functional
Categorical

Distinct count	7
Unique (%)	0.5%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Typ	1360
Min2	34
Min1	31
Other values (4)	35

Toggle details

Value	Count	Frequency (%)
Typ	1360	93.2%
Min2	34	2.3%
Min1	31	2.1%
Mod	15	1.0%
Maj1	14	1.0%
Maj2	5	0.3%
Sev	1	0.1%

GarageArea
Numeric

Distinct count	441
Unique (%)	30.2%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	472.98
Minimum	0
Maximum	1418
Zeros (%)	5.5%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	334.5
Median	480
Q3	576
95-th percentile	850.1
Maximum	1418
Range	1418
Interquartile range	241.5

Descriptive statistics

Standard deviation	213.8
Coef of variation	0.45204
Kurtosis	0.90982
Mean	472.98
MAD	160.02
Skewness	0.1798
Sum	690550
Variance	45713
Memory size	0.0 B

GarageCars
Numeric

Distinct count	5
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	1.7671
Minimum	0
Maximum	4
Zeros (%)	5.5%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	1
Median	2
Q3	2
95-th percentile	3
Maximum	4
Range	4
Interquartile range	1

Descriptive statistics

Standard deviation	0.74732
Coef of variation	0.4229
Kurtosis	0.21613
Mean	1.7671
MAD	0.58384
Skewness	-0.3422
Sum	2580
Variance	0.55848
Memory size	0.0 B

GarageCond
Categorical

Distinct count	6
Unique (%)	0.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

TA	1326
NA	81
Fa	35
Other values (3)	18

Toggle details

Value	Count	Frequency (%)
TA	1326	90.8%
NA	81	5.5%
Fa	35	2.4%
Gd	9	0.6%
Po	7	0.5%
Ex	2	0.1%

GarageFinish
Categorical

Distinct count	4
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Unf	605
RFn	422
Fin	352

Toggle details

Value	Count	Frequency (%)
Unf	605	41.4%
RFn	422	28.9%
Fin	352	24.1%
NA	81	5.5%

GarageQual
Categorical

Distinct count	6
Unique (%)	0.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

TA	1311
NA	81
Fa	48
Other values (3)	20

Toggle details

Value	Count	Frequency (%)
TA	1311	89.8%
NA	81	5.5%
Fa	48	3.3%
Gd	14	1.0%
Po	3	0.2%
Ex	3	0.2%

GarageType
Categorical

Distinct count	7
Unique (%)	0.5%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Attchd	870
Detchd	387
BuiltIn	88
Other values (4)	115

Toggle details

Value	Count	Frequency (%)
Attchd	870	59.6%
Detchd	387	26.5%
BuiltIn	88	6.0%
NA	81	5.5%
Basment	19	1.3%
CarPort	9	0.6%
2Types	6	0.4%

GarageYrBlt
Categorical

Distinct count	98
Unique (%)	6.7%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

NA	81
2005	65
2006	59
Other values (95)	1255

Toggle details

Value	Count	Frequency (%)
NA	81	5.5%
2005	65	4.5%
2006	59	4.0%
2004	53	3.6%
2003	50	3.4%
2007	49	3.4%
1977	35	2.4%
1998	31	2.1%
1999	30	2.1%
1976	29	2.0%
2008	29	2.0%
2000	27	1.8%
1968	26	1.8%
2002	26	1.8%
1950	24	1.6%
1993	22	1.5%
1958	21	1.4%
1965	21	1.4%
1962	21	1.4%
2009	21	1.4%
Other values (78)	740	50.7%

GrLivArea
Numeric

Distinct count	861
Unique (%)	59.0%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	1515.5
Minimum	334
Maximum	5642
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	334
5-th percentile	848
Q1	1129.5
Median	1464
Q3	1776.8
95-th percentile	2466.1
Maximum	5642
Range	5308
Interquartile range	647.25

Descriptive statistics

Standard deviation	525.48
Coef of variation	0.34675
Kurtosis	4.8743
Mean	1515.5
MAD	397.32
Skewness	1.3652
Sum	2212600
Variance	276130
Memory size	0.0 B

HalfBath
Numeric

Distinct count	3
Unique (%)	0.2%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	0.38288
Minimum	0
Maximum	2
Zeros (%)	62.5%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	1
95-th percentile	1
Maximum	2
Range	2
Interquartile range	1

Descriptive statistics

Standard deviation	0.50289
Coef of variation	1.3134
Kurtosis	-1.0773
Mean	0.38288
MAD	0.47886
Skewness	0.6752
Sum	559
Variance	0.25289
Memory size	0.0 B

Heating
Categorical

Distinct count	6
Unique (%)	0.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

GasA	1428
GasW	18
Grav	7
Other values (3)	7

Toggle details

Value	Count	Frequency (%)
GasA	1428	97.8%
GasW	18	1.2%
Grav	7	0.5%
Wall	4	0.3%
OthW	2	0.1%
Floor	1	0.1%

HeatingQC
Categorical

Distinct count	5
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Ex	741
TA	428
Gd	241
Other values (2)	50

Toggle details

Value	Count	Frequency (%)
Ex	741	50.8%
TA	428	29.3%
Gd	241	16.5%
Fa	49	3.4%
Po	1	0.1%

HouseStyle
Categorical

Distinct count	8
Unique (%)	0.5%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

1Story	726
2Story	445
1.5Fin	154
Other values (5)	135

Toggle details

Value	Count	Frequency (%)
1Story	726	49.7%
2Story	445	30.5%
1.5Fin	154	10.5%
SLvl	65	4.5%
SFoyer	37	2.5%
1.5Unf	14	1.0%
2.5Unf	11	0.8%
2.5Fin	8	0.5%

Id
Numeric

Distinct count	1460
Unique (%)	100.0%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	730.5
Minimum	1
Maximum	1460
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	1
5-th percentile	73.95
Q1	365.75
Median	730.5
Q3	1095.2
95-th percentile	1387
Maximum	1460
Range	1459
Interquartile range	729.5

Descriptive statistics

Standard deviation	421.61
Coef of variation	0.57715
Kurtosis	-1.2
Mean	730.5
MAD	365
Skewness	0
Sum	1066500
Variance	177760
Memory size	0.0 B

KitchenAbvGr
Numeric

Distinct count	4
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	1.0466
Minimum	0
Maximum	3
Zeros (%)	0.1%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	1
Q1	1
Median	1
Q3	1
95-th percentile	1
Maximum	3
Range	3
Interquartile range	0

Descriptive statistics

Standard deviation	0.22034
Coef of variation	0.21053
Kurtosis	21.455
Mean	1.0466
MAD	0.090246
Skewness	4.4838
Sum	1528
Variance	0.048549
Memory size	0.0 B

KitchenQual
Categorical

Distinct count	4
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

TA	735
Gd	586
Ex	100

Toggle details

Value	Count	Frequency (%)
TA	735	50.3%
Gd	586	40.1%
Ex	100	6.8%
Fa	39	2.7%

LandContour
Categorical

Distinct count	4
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Lvl	1311
Bnk	63
HLS	50

Toggle details

Value	Count	Frequency (%)
Lvl	1311	89.8%
Bnk	63	4.3%
HLS	50	3.4%
Low	36	2.5%

LandSlope
Categorical

Distinct count	3
Unique (%)	0.2%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Gtl	1382
Mod	65
Sev	13

Toggle details

Value	Count	Frequency (%)
Gtl	1382	94.7%
Mod	65	4.5%
Sev	13	0.9%

LotArea
Numeric

Distinct count	1073
Unique (%)	73.5%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	10517
Minimum	1300
Maximum	215240
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	1300
5-th percentile	3311.7
Q1	7553.5
Median	9478.5
Q3	11602
95-th percentile	17401
Maximum	215240
Range	213940
Interquartile range	4048

Descriptive statistics

Standard deviation	9981.3
Coef of variation	0.94908
Kurtosis	202.54
Mean	10517
MAD	3758.8
Skewness	12.195
Sum	15355000
Variance	99626000
Memory size	0.0 B

LotConfig
Categorical

Distinct count	5
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Inside	1052
Corner	263
CulDSac	94
Other values (2)	51

Toggle details

Value	Count	Frequency (%)
Inside	1052	72.1%
Corner	263	18.0%
CulDSac	94	6.4%
FR2	47	3.2%
FR3	4	0.3%

LotFrontage
Categorical

Distinct count	111
Unique (%)	7.6%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

NA	259
60	143
70	70
Other values (108)	988

Toggle details

Value	Count	Frequency (%)
NA	259	17.7%
60	143	9.8%
70	70	4.8%
80	69	4.7%
50	57	3.9%
75	53	3.6%
65	44	3.0%
85	40	2.7%
78	25	1.7%
90	23	1.6%
21	23	1.6%
64	19	1.3%
68	19	1.3%
24	19	1.3%
73	18	1.2%
55	17	1.2%
79	17	1.2%
63	17	1.2%
72	17	1.2%
100	16	1.1%
Other values (91)	495	33.9%

LotShape
Categorical

Distinct count	4
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Reg	925
IR1	484
IR2	41

Toggle details

Value	Count	Frequency (%)
Reg	925	63.4%
IR1	484	33.2%
IR2	41	2.8%
IR3	10	0.7%

LowQualFinSF
Numeric

Distinct count	24
Unique (%)	1.6%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	5.8445
Minimum	0
Maximum	572
Zeros (%)	98.2%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	0
95-th percentile	0
Maximum	572
Range	572
Interquartile range	0

Descriptive statistics

Standard deviation	48.623
Coef of variation	8.3194
Kurtosis	82.946
Mean	5.8445
MAD	11.481
Skewness	9.0021
Sum	8533
Variance	2364.2
Memory size	0.0 B

MSSubClass
Numeric

Distinct count	15
Unique (%)	1.0%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	56.897
Minimum	20
Maximum	190
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	20
5-th percentile	20
Q1	20
Median	50
Q3	70
95-th percentile	160
Maximum	190
Range	170
Interquartile range	50

Descriptive statistics

Standard deviation	42.301
Coef of variation	0.74346
Kurtosis	1.5707
Mean	56.897
MAD	31.283
Skewness	1.4062
Sum	83070
Variance	1789.3
Memory size	0.0 B

MSZoning
Categorical

Distinct count	5
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

RL	1151
RM	218
FV	65
Other values (2)	26

Toggle details

Value	Count	Frequency (%)
RL	1151	78.8%
RM	218	14.9%
FV	65	4.5%
RH	16	1.1%
C (all)	10	0.7%

MasVnrArea
Categorical

Distinct count	328
Unique (%)	22.5%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

0	861
108	8
180	8
Other values (325)	583

Toggle details

Value	Count	Frequency (%)
0	861	59.0%
108	8	0.5%
180	8	0.5%
72	8	0.5%
NA	8	0.5%
16	7	0.5%
120	7	0.5%
200	6	0.4%
106	6	0.4%
340	6	0.4%
80	6	0.4%
170	5	0.3%
84	5	0.3%
320	5	0.3%
360	5	0.3%
132	5	0.3%
216	4	0.3%
196	4	0.3%
76	4	0.3%
210	4	0.3%
Other values (308)	488	33.4%

MasVnrType
Categorical

Distinct count	5
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

None	864
BrkFace	445
Stone	128
Other values (2)	23

Toggle details

Value	Count	Frequency (%)
None	864	59.2%
BrkFace	445	30.5%
Stone	128	8.8%
BrkCmn	15	1.0%
NA	8	0.5%

MiscFeature
Categorical

Distinct count	5
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

NA	1406
Shed	49
Gar2	2
Other values (2)	3

Toggle details

Value	Count	Frequency (%)
NA	1406	96.3%
Shed	49	3.4%
Gar2	2	0.1%
Othr	2	0.1%
TenC	1	0.1%

MiscVal
Numeric

Distinct count	21
Unique (%)	1.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	43.489
Minimum	0
Maximum	15500
Zeros (%)	96.4%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	0
95-th percentile	0
Maximum	15500
Range	15500
Interquartile range	0

Descriptive statistics

Standard deviation	496.12
Coef of variation	11.408
Kurtosis	698.6
Mean	43.489
MAD	83.88
Skewness	24.452
Sum	63494
Variance	246140
Memory size	0.0 B

MoSold
Numeric

Distinct count	12
Unique (%)	0.8%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	6.3219
Minimum	1
Maximum	12
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	1
5-th percentile	2
Q1	5
Median	6
Q3	8
95-th percentile	11
Maximum	12
Range	11
Interquartile range	3

Descriptive statistics

Standard deviation	2.7036
Coef of variation	0.42766
Kurtosis	-0.40683
Mean	6.3219
MAD	2.1425
Skewness	0.21184
Sum	9230
Variance	7.3096
Memory size	0.0 B

Neighborhood
Categorical

Distinct count	25
Unique (%)	1.7%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

NAmes	225
CollgCr	150
OldTown	113
Other values (22)	972

Toggle details

Value	Count	Frequency (%)
NAmes	225	15.4%
CollgCr	150	10.3%
OldTown	113	7.7%
Edwards	100	6.8%
Somerst	86	5.9%
Gilbert	79	5.4%
NridgHt	77	5.3%
Sawyer	74	5.1%
NWAmes	73	5.0%
SawyerW	59	4.0%
BrkSide	58	4.0%
Crawfor	51	3.5%
Mitchel	49	3.4%
NoRidge	41	2.8%
Timber	38	2.6%
IDOTRR	37	2.5%
ClearCr	28	1.9%
StoneBr	25	1.7%
SWISU	25	1.7%
MeadowV	17	1.2%
Other values (5)	55	3.8%

OpenPorchSF
Numeric

Distinct count	202
Unique (%)	13.8%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	46.66
Minimum	0
Maximum	547
Zeros (%)	44.9%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	25
Q3	68
95-th percentile	175.05
Maximum	547
Range	547
Interquartile range	68

Descriptive statistics

Standard deviation	66.256
Coef of variation	1.42
Kurtosis	8.4572
Mean	46.66
MAD	47.678
Skewness	2.3619
Sum	68124
Variance	4389.9
Memory size	0.0 B

OverallCond
Numeric

Distinct count	9
Unique (%)	0.6%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	5.5753
Minimum	1
Maximum	9
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	1
5-th percentile	4
Q1	5
Median	5
Q3	6
95-th percentile	8
Maximum	9
Range	8
Interquartile range	1

Descriptive statistics

Standard deviation	1.1128
Coef of variation	0.19959
Kurtosis	1.0985
Mean	5.5753
MAD	0.88902
Skewness	0.69236
Sum	8140
Variance	1.2383
Memory size	0.0 B

OverallQual
Numeric

Distinct count	10
Unique (%)	0.7%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	6.0993
Minimum	1
Maximum	10
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	1
5-th percentile	4
Q1	5
Median	6
Q3	7
95-th percentile	8
Maximum	10
Range	9
Interquartile range	2

Descriptive statistics

Standard deviation	1.383
Coef of variation	0.22675
Kurtosis	0.091857
Mean	6.0993
MAD	1.098
Skewness	0.21672
Sum	8905
Variance	1.9127
Memory size	0.0 B

PavedDrive
Categorical

Distinct count	3
Unique (%)	0.2%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Y	1340
N	90
P	30

Toggle details

Value	Count	Frequency (%)
Y	1340	91.8%
N	90	6.2%
P	30	2.1%

PoolArea
Numeric

Distinct count	8
Unique (%)	0.5%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	2.7589
Minimum	0
Maximum	738
Zeros (%)	99.5%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	0
95-th percentile	0
Maximum	738
Range	738
Interquartile range	0

Descriptive statistics

Standard deviation	40.177
Coef of variation	14.563
Kurtosis	222.5
Mean	2.7589
MAD	5.4914
Skewness	14.813
Sum	4028
Variance	1614.2
Memory size	0.0 B

PoolQC
Categorical

Distinct count	4
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

NA	1453
Gd	3
Ex	2

Toggle details

Value	Count	Frequency (%)
NA	1453	99.5%
Gd	3	0.2%
Ex	2	0.1%
Fa	2	0.1%

RoofMatl
Categorical

Distinct count	8
Unique (%)	0.5%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

CompShg	1434
Tar&Grv	11
WdShngl	6
Other values (5)	9

Toggle details

Value	Count	Frequency (%)
CompShg	1434	98.2%
Tar&Grv	11	0.8%
WdShngl	6	0.4%
WdShake	5	0.3%
Membran	1	0.1%
ClyTile	1	0.1%
Metal	1	0.1%
Roll	1	0.1%

RoofStyle
Categorical

Distinct count	6
Unique (%)	0.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Gable	1141
Hip	286
Flat	13
Other values (3)	20

Toggle details

Value	Count	Frequency (%)
Gable	1141	78.2%
Hip	286	19.6%
Flat	13	0.9%
Gambrel	11	0.8%
Mansard	7	0.5%
Shed	2	0.1%

SaleCondition
Categorical

Distinct count	6
Unique (%)	0.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Normal	1198
Partial	125
Abnorml	101
Other values (3)	36

Toggle details

Value	Count	Frequency (%)
Normal	1198	82.1%
Partial	125	8.6%
Abnorml	101	6.9%
Family	20	1.4%
Alloca	12	0.8%
AdjLand	4	0.3%

SalePrice
Numeric

Distinct count	663
Unique (%)	45.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	180920
Minimum	34900
Maximum	755000
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	34900
5-th percentile	88000
Q1	129980
Median	163000
Q3	214000
95-th percentile	326100
Maximum	755000
Range	720100
Interquartile range	84025

Descriptive statistics

Standard deviation	79443
Coef of variation	0.4391
Kurtosis	6.5098
Mean	180920
MAD	57435
Skewness	1.8809
Sum	264140000
Variance	6311100000
Memory size	0.0 B

SaleType
Categorical

Distinct count	9
Unique (%)	0.6%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

WD	1267
New	122
COD	43
Other values (6)	28

Toggle details

Value	Count	Frequency (%)
WD	1267	86.8%
New	122	8.4%
COD	43	2.9%
ConLD	9	0.6%
ConLI	5	0.3%
ConLw	5	0.3%
CWD	4	0.3%
Oth	3	0.2%
Con	2	0.1%

ScreenPorch
Numeric

Distinct count	76
Unique (%)	5.2%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	15.061
Minimum	0
Maximum	480
Zeros (%)	92.1%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	0
95-th percentile	160
Maximum	480
Range	480
Interquartile range	0

Descriptive statistics

Standard deviation	55.757
Coef of variation	3.7021
Kurtosis	18.372
Mean	15.061
MAD	27.729
Skewness	4.118
Sum	21989
Variance	3108.9
Memory size	0.0 B

Street
Categorical

Distinct count	2
Unique (%)	0.1%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Pave	1454
Grvl	6

Toggle details

Value	Count	Frequency (%)
Pave	1454	99.6%
Grvl	6	0.4%

TotRmsAbvGrd
Numeric

Distinct count	12
Unique (%)	0.8%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	6.5178
Minimum	2
Maximum	14
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	2
5-th percentile	4
Q1	5
Median	6
Q3	7
95-th percentile	10
Maximum	14
Range	12
Interquartile range	2

Descriptive statistics

Standard deviation	1.6254
Coef of variation	0.24938
Kurtosis	0.87364
Mean	6.5178
MAD	1.2796
Skewness	0.67565
Sum	9516
Variance	2.6419
Memory size	0.0 B

TotalBsmtSF
Numeric

Distinct count	721
Unique (%)	49.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	1057.4
Minimum	0
Maximum	6110
Zeros (%)	2.5%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	519.3
Q1	795.75
Median	991.5
Q3	1298.2
95-th percentile	1753
Maximum	6110
Range	6110
Interquartile range	502.5

Descriptive statistics

Standard deviation	438.71
Coef of variation	0.41488
Kurtosis	13.201
Mean	1057.4
MAD	321.28
Skewness	1.5227
Sum	1543800
Variance	192460
Memory size	0.0 B

Utilities
Categorical

Distinct count	2
Unique (%)	0.1%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

AllPub	1459
NoSeWa	1

Toggle details

Value	Count	Frequency (%)
AllPub	1459	99.9%
NoSeWa	1	0.1%

WoodDeckSF
Numeric

Distinct count	274
Unique (%)	18.8%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	94.245
Minimum	0
Maximum	857
Zeros (%)	52.1%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	168
95-th percentile	335
Maximum	857
Range	857
Interquartile range	168

Descriptive statistics

Standard deviation	125.34
Coef of variation	1.3299
Kurtosis	2.9786
Mean	94.245
MAD	102
Skewness	1.5398
Sum	137600
Variance	15710
Memory size	0.0 B

YearBuilt
Numeric

Distinct count	112
Unique (%)	7.7%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	1971.3
Minimum	1872
Maximum	2010
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	1872
5-th percentile	1916
Q1	1954
Median	1973
Q3	2000
95-th percentile	2007
Maximum	2010
Range	138
Interquartile range	46

Descriptive statistics

Standard deviation	30.203
Coef of variation	0.015322
Kurtosis	-0.44215
Mean	1971.3
MAD	25.067
Skewness	-0.61283
Sum	2878100
Variance	912.22
Memory size	0.0 B

YearRemodAdd
Numeric

Distinct count	61
Unique (%)	4.2%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	1984.9
Minimum	1950
Maximum	2010
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	1950
5-th percentile	1950
Q1	1967
Median	1994
Q3	2004
95-th percentile	2007
Maximum	2010
Range	60
Interquartile range	37

Descriptive statistics

Standard deviation	20.645
Coef of variation	0.010401
Kurtosis	-1.272
Mean	1984.9
MAD	18.623
Skewness	-0.50304
Sum	2897900
Variance	426.23
Memory size	0.0 B

YrSold
Numeric

Distinct count	5
Unique (%)	0.3%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	2007.8
Minimum	2006
Maximum	2010
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	2006
5-th percentile	2006
Q1	2007
Median	2008
Q3	2009
95-th percentile	2010
Maximum	2010
Range	4
Interquartile range	2

Descriptive statistics

Standard deviation	1.3281
Coef of variation	0.00066146
Kurtosis	-1.1906
Mean	2007.8
MAD	1.1487
Skewness	0.09617
Sum	2931400
Variance	1.7638
Memory size	0.0 B

Sample

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65	8450	Pave	NA	Reg	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2003	2003	Gable	CompShg	VinylSd	VinylSd	BrkFace	196	Gd	TA	PConc	Gd	TA	No	GLQ	706	Unf	150	856	GasA	Ex	Y	SBrkr	856	854	1710	1	0	2	1	3	1	Gd	8	Typ	0	NA	Attchd	2003	RFn	2	548	TA	TA	Y	0	61	0	NA	NA	NA	2	2008	WD	Normal	208500
1	2	20	RL	80	9600	Pave	NA	Reg	Lvl	AllPub	FR2	Gtl	Veenker	Feedr	Norm	1Fam	1Story	6	8	1976	1976	Gable	CompShg	MetalSd	MetalSd	None	0	TA	TA	CBlock	Gd	TA	Gd	ALQ	978	Unf	284	1262	GasA	Ex	Y	SBrkr	1262	0	1262	0	1	2	0	3	1	TA	6	Typ	1	TA	Attchd	1976	RFn	2	460	TA	TA	Y	298	0	0	NA	NA	NA	5	2007	WD	Normal	181500
2	3	60	RL	68	11250	Pave	NA	IR1	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2001	2002	Gable	CompShg	VinylSd	VinylSd	BrkFace	162	Gd	TA	PConc	Gd	TA	Mn	GLQ	486	Unf	434	920	GasA	Ex	Y	SBrkr	920	866	1786	1	0	2	1	3	1	Gd	6	Typ	1	TA	Attchd	2001	RFn	2	608	TA	TA	Y	0	42	0	NA	NA	NA	9	2008	WD	Normal	223500
3	4	70	RL	60	9550	Pave	NA	IR1	Lvl	AllPub	Corner	Gtl	Crawfor	Norm	Norm	1Fam	2Story	7	5	1915	1970	Gable	CompShg	Wd Sdng	Wd Shng	None	0	TA	TA	BrkTil	TA	Gd	No	ALQ	216	Unf	540	756	GasA	Gd	Y	SBrkr	961	756	1717	1	0	1	0	3	1	Gd	7	Typ	1	Gd	Detchd	1998	Unf	3	642	TA	TA	Y	0	35	272	NA	NA	NA	2	2006	WD	Abnorml	140000
4	5	60	RL	84	14260	Pave	NA	IR1	Lvl	AllPub	FR2	Gtl	NoRidge	Norm	Norm	1Fam	2Story	8	5	2000	2000	Gable	CompShg	VinylSd	VinylSd	BrkFace	350	Gd	TA	PConc	Gd	TA	Av	GLQ	655	Unf	490	1145	GasA	Ex	Y	SBrkr	1145	1053	2198	1	0	2	1	4	1	Gd	9	Typ	1	TA	Attchd	2000	RFn	3	836	TA	TA	Y	192	84	0	NA	NA	NA	12	2008	WD	Normal	250000

This class makes a profile for a given dataframe and its different general features. Based on spark-df-profiling by Julio Soto.

This overview presents basic information about the DataFrame, like number of variable it has, how many are missing values and in which column, the types of each variable, also some statistical information that describes the variable plus a frequency plot. Also a table that specifies the existing datatypes in each column dataFrame and other features.

You can also use Spark's native describe function to get something very similar of what you got using pandas.

In [11]:

df.describe().toPandas()

Out[11]:

	summary	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	count	1460	1460	1460	1460	1460	1460	1460	1460	1460	...	1460	1460	1460	1460	1460	1460	1460	1460	1460	1460
1	mean	730.5	56.897260273972606	None	70.04995836802665	10516.828082191782	None	None	None	None	...	2.758904109589041	None	None	None	43.489041095890414	6.321917808219178	2007.8157534246575	None	None	180921.19589041095
2	stddev	421.6100093688479	42.30057099381045	None	24.28475177448321	9981.26493237915	None	None	None	None	...	40.17730694453021	None	None	None	496.1230244579441	2.7036262083595113	1.3280951205521145	None	None	79442.50288288663
3	min	1	20	C (all)	100	1300	Grvl	Grvl	IR1	Bnk	...	0	Ex	GdPrv	Gar2	0	1	2006	COD	Abnorml	34900
4	max	1460	190	RM	NA	215245	Pave	Pave	Reg	Lvl	...	738	NA	NA	TenC	15500	12	2010	WD	Partial	755000

5 rows × 82 columns

Selecting and Filtering Data using Pandas¶

Your dataset had too many variables to wrap your head around, or even to print out nicely. This is just one of many situations where you'll want to access a smaller set of your data.

For now, we'll rely on our intuition to pick variables to focus on. Later tutorials will show you statistical techniques to automatically prioritize variables.

Before we can choose columns, it is helpful to see a list of all columns in the dataset. That is done with the columns property of the DataFrame (the bottom line of code below).

In [12]:

file_path = 'train.csv'
df = pd.read_csv(file_path) 
print(df.columns)

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

There are many ways to select a subset of your data. We'll start with two main approaches:

Tasks:¶

Print a list of the columns
From the list of columns, find a name of the column with the sales prices of the homes. Use the dot notation to extract this to a variable (as you saw above in the original notebook)
Use the head command to print out the top few lines of the variable you just created.
Pick any two variables and store them to a new DataFrame (as you saw in the original notebook when you created two_columns_of_data.)
Use the describe command with the DataFrame you just created to see summaries of those variables.

Print a list of the columns¶

In [13]:

df.columns

Out[13]:

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

From the list of columns, find a name of the column with the sales prices of the homes. Use the dot notation to extract this to a variable¶

In [14]:

house_price = df.SalePrice

Use the head command to print out the top few lines of the variable you just created.¶

In [15]:

house_price.head()

Out[15]:

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

Pick any two variables and store them to a new DataFrame¶

In [16]:

columns_of_interest = ['1stFlrSF', '2ndFlrSF']
two_columns_of_data = df[columns_of_interest]

In [17]:

two_columns_of_data

Out[17]:

	1stFlrSF	2ndFlrSF
0	856	854
1	1262	0
2	920	866
3	961	756
4	1145	1053
5	796	566
6	1694	0
7	1107	983
8	1022	752
9	1077	0
10	1040	0
11	1182	1142
12	912	0
13	1494	0
14	1253	0
15	854	0
16	1004	0
17	1296	0
18	1114	0
19	1339	0
20	1158	1218
21	1108	0
22	1795	0
23	1060	0
24	1060	0
25	1600	0
26	900	0
27	1704	0
28	1600	0
29	520	0
...	...	...
1430	734	1104
1431	958	0
1432	968	0
1433	962	830
1434	1126	0
1435	1537	0
1436	864	0
1437	1932	0
1438	1236	0
1439	1040	685
1440	1423	748
1441	848	0
1442	1026	981
1443	952	0
1444	1422	0
1445	913	0
1446	1188	0
1447	1220	870
1448	796	550
1449	630	0
1450	896	896
1451	1578	0
1452	1072	0
1453	1140	0
1454	1221	0
1455	953	694
1456	2073	0
1457	1188	1152
1458	1078	0
1459	1256	0

1460 rows × 2 columns

Use the describe command with the DataFrame you just created to see summaries of those variables.¶

In [18]:

two_columns_of_data.describe()

Out[18]:

	1stFlrSF	2ndFlrSF
count	1460.000000	1460.000000
mean	1162.626712	346.992466
std	386.587738	436.528436
min	334.000000	0.000000
25%	882.000000	0.000000
50%	1087.000000	0.000000
75%	1391.250000	728.000000
max	4692.000000	2065.000000

Selecting and Filtering Data using PySpark and Optimus¶

In [19]:

import optimus as op
tools = op.Utilities()
df = tools.read_csv("train.csv")

Print a list of the columns¶

In [20]:

df.columns

Out[20]:

['Id',
 'MSSubClass',
 'MSZoning',
 'LotFrontage',
 'LotArea',
 'Street',
 'Alley',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'MasVnrArea',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinSF1',
 'BsmtFinType2',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'KitchenQual',
 'TotRmsAbvGrd',
 'Functional',
 'Fireplaces',
 'FireplaceQu',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageCars',
 'GarageArea',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'MiscVal',
 'MoSold',
 'YrSold',
 'SaleType',
 'SaleCondition',
 'SalePrice']

From the list of columns, find a name of the column with the sales prices of the homes. Use the dot notation to extract this to a variable¶

This is not the same in Spark, is more SQL-like, and the dot notation exists, but it will give you a Column, not that useful for now, so let's better select it. This will create another Spark DF.

In [21]:

house_price = df.select("SalePrice")

Use the head command to print out the top few lines of the variable you just created.¶

Again not the same behavior, this will happen if you use head:

In [22]:

house_price.head()

Out[22]:

Row(SalePrice=208500)

What? I know... But if you use the show() method to see its content, if we do it with n=5 is the same as head in Pandas. If you want to know more about Rows in spark go here: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Row

In [23]:

house_price.show(n=5)

+---------+
|SalePrice|
+---------+
|   208500|
|   181500|
|   223500|
|   140000|
|   250000|
+---------+
only showing top 5 rows

Woo!

Pick any two variables and store them to a new DataFrame¶

We'll use again select().

In [24]:

two_columns_of_data = df.select("1stFlrSF","2ndFlrSF")

In [25]:

two_columns_of_data.show(n=2)

+--------+--------+
|1stFlrSF|2ndFlrSF|
+--------+--------+
|     856|     854|
|    1262|       0|
+--------+--------+
only showing top 2 rows

or

In [26]:

columns_of_interest = ['1stFlrSF', '2ndFlrSF']
two_columns_of_data = df.select(columns_of_interest)

In [27]:

two_columns_of_data.show(n=2)

+--------+--------+
|1stFlrSF|2ndFlrSF|
+--------+--------+
|     856|     854|
|    1262|       0|
+--------+--------+
only showing top 2 rows

Use the describe command with the DataFrame you just created to see summaries of those variables.¶

In [28]:

two_columns_of_data.describe().show()

+-------+-----------------+------------------+
|summary|         1stFlrSF|          2ndFlrSF|
+-------+-----------------+------------------+
|  count|             1460|              1460|
|   mean|1162.626712328767|346.99246575342465|
| stddev|386.5877380410744|  436.528435886257|
|    min|              334|                 0|
|    max|             4692|              2065|
+-------+-----------------+------------------+

And with the Optimus profiler (Remember clicking on Toggle Details):

In [29]:

profiler = op.DataFrameProfiler(two_columns_of_data)
profiler.profiler()

Out[29]:

Overview

Dataset info

Number of variables	2
Number of observations	1460
Total Missing (%)	0.0%
Total size in memory	0.0 B
Average record size in memory	0.0 B

Variables types

Numeric	2
Categorical	0
Date	0
Text (Unique)	0
Rejected	0

Warnings

2ndFlrSF has 829 / 56.8% zeros

Variables

1stFlrSF
Numeric

Distinct count	753
Unique (%)	51.6%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	1162.6
Minimum	334
Maximum	4692
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	334
5-th percentile	672.95
Q1	882
Median	1087
Q3	1391.2
95-th percentile	1831.2
Maximum	4692
Range	4358
Interquartile range	509.25

Descriptive statistics

Standard deviation	386.59
Coef of variation	0.33251
Kurtosis	5.7221
Mean	1162.6
MAD	300.58
Skewness	1.3753
Sum	1697400
Variance	149450
Memory size	0.0 B

2ndFlrSF
Numeric

Distinct count	417
Unique (%)	28.6%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	346.99
Minimum	0
Maximum	2065
Zeros (%)	56.8%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
Median	0
Q3	728
95-th percentile	1141
Maximum	2065
Range	2065
Interquartile range	728

Descriptive statistics

Standard deviation	436.53
Coef of variation	1.258
Kurtosis	-0.55568
Mean	346.99
MAD	396.48
Skewness	0.81219
Sum	506610
Variance	190560
Memory size	0.0 B

Sample

	1stFlrSF	2ndFlrSF
0	856	854
1	1262	0
2	920	866
3	961	756
4	1145	1053

Your First Scikit-Learn Model¶

You have the code to load your data, and you know how to index it. You are ready to choose which column you want to predict. This column is called the prediction target. There is a convention that the prediction target is referred to as y. Here is an example doing that with the example data.

Check the code in the original repo before running this! https://www.kaggle.com/dansbecker/your-first-scikit-learn-model

Your Turn¶

Now it's time for you to define and fit a model for your data (in your notebook).

Select the target variable you want to predict. You can go back to the list of columns from your earlier commands to recall what it's called (hint: you've already worked with this variable). Save this to a new variable called y.
Create a list of the names of the predictors we will use in the initial model. Use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):
- LotArea
- YearBuilt
- 1stFlrSF
- 2ndFlrSF
- FullBath
- BedroomAbvGr
- TotRmsAbvGrd
Using the list of variable names you just created, select a new DataFrame of the predictors data. Save this with the variable name X.
Create a DecisionTreeRegressorModel and save it to a variable (with a name like my_model or iowa_model). Ensure you've done the relevant import so you can run this command.
Fit the model you have created using the data in X and the target data you saved above.
Make a few predictions with the model's predict command and print out the predictions.

1. Select the target variable you want to predict. You can go back to the list of columns from your earlier commands to recall what it's called (hint: you've already worked with this variable). Save this to a new variable called y.¶

The varialbe we want to predict is the SalePrice. So:

In [30]:

import pandas as pd

file_path = 'train.csv'
df = pd.read_csv(file_path) 

y = df.SalePrice

2. Create a list of the names of the predictors we will use in the initial model. Use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):¶

- LotArea
- YearBuilt
- 1stFlrSF
- 2ndFlrSF
- FullBath
- BedroomAbvGr
- TotRmsAbvGrd

In [31]:

predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
                        'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

3. Using the list of variable names you just created, select a new DataFrame of the predictors data. Save this with the variable name X.¶

In [32]:

X = df[predictors]

4. Create a DecisionTreeRegressorModel and save it to a variable (with a name like my_model or iowa_model). Ensure you've done the relevant import so you can run this command.¶

You will use the scikit-learn library to create your models. When coding, this library is written as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames. We will use Spark after :).

The steps to building and using a model are:

Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
Fit: Capture patterns from provided data. This is the heart of modeling.
Predict: Just what it sounds like
Evaluate: Determine how accurate the model's predictions are.

In [33]:

from sklearn.tree import DecisionTreeRegressor

# Define model
my_model = DecisionTreeRegressor()

5. Fit the model you have created using the data in X and the target data you saved above.¶

In [34]:

# Fit model
my_model.fit(X, y)

Out[34]:

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

6. Make a few predictions with the model's predict command and print out the predictions.¶

In [35]:

print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(my_model.predict(X.head()))

Making predictions for the following 5 houses:
   LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
0     8450       2003       856       854         2             3   
1     9600       1976      1262         0         2             3   
2    11250       2001       920       866         2             3   
3     9550       1915       961       756         1             3   
4    14260       2000      1145      1053         2             4   

   TotRmsAbvGrd  
0             8  
1             6  
2             6  
3             7  
4             9  
The predictions are
[ 208500.  181500.  223500.  140000.  250000.]

The real prices are

In [36]:

y.head().tolist()

Out[36]:

[208500, 181500, 223500, 140000, 250000]

So it's a very good model :). Or is it?

You've built a decision tree model that can predict the prices of houses based on their characteristics. It's natural to ask how accurate the model's predictions will be, and measuring accuracy is necessary for us to see whether or not other approaches improve our model.

Your First Spark ML Model (here only Spark, next with Optimus)¶

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:

ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
Featurization: feature extraction, transformation, dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
Persistence: saving and load algorithms, models, and Pipelines
Utilities: linear algebra, statistics, data handling, etc.

“Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API. This is majorly due to the org.apache.spark.ml Scala package name used by the DataFrame-based API, and the “Spark ML Pipelines” term we used initially to emphasize the pipeline concept.

The varialbe we want to predict is the SalePrice. So:

In [37]:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

file_path = 'train.csv'
df = spark.read.csv(file_path, header="true", inferSchema=True) 

In [24]:

from pyspark.ml.feature import VectorAssembler, VectorIndexer

# Choosing predictors
features_cols = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
                        'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# This concatenates all feature columns into a single feature vector in a new column "rawFeatures".
vectorAssembler = VectorAssembler(inputCols=features_cols, outputCol="raw_features")
# This identifies categorical features and indexes them.
vectorIndexer = VectorIndexer(inputCol="raw_features", outputCol="features", maxCategories=4)

In [40]:

from pyspark.ml.regression import DecisionTreeRegressor
# Takes the "features" column and learns to predict "SalePrice"
dt = DecisionTreeRegressor(labelCol="SalePrice")

In [25]:

from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[
        vectorAssembler, 
        vectorIndexer, 
        dt
    ])

In [42]:

model = pipeline.fit(df)

In [43]:

predictions = model.transform(df)

In [44]:

predictions.select("SalePrice","Prediction").show(5)

+---------+------------------+
|SalePrice|        Prediction|
+---------+------------------+
|   208500|219910.90697674418|
|   181500|150758.37549407114|
|   223500|219910.90697674418|
|   140000|149397.38805970148|
|   250000|300287.74358974356|
+---------+------------------+
only showing top 5 rows

So it seems not that easy with Spark, and not even that accurate, but remember this will scale, and as you will see, spark needs more tweaks to improve performance. But more on that later.

Model Validation¶

What is Model Validation¶

You've built a model. But how good is it?

You'll need to answer this question for almost every model you ever build. In most (though not necessarily all) applications, the relevant measure of model quality is predictive accuracy. In other words, will the model's predictions be close to what actually happens.

Some people try answering this problem by making predictions with their training data. They compare those predictions to the actual target values in the training data. This approach has a critical shortcoming, which you will see in a moment (and which you'll subsequently see how to solve).

Even with this simple approach, you'll need to summarize the model quality into a form that someone can understand. If you have predicted and actual home values for 10000 houses, you will inevitably end up with a mix of good and bad predictions. Looking through such a long list would be pointless.

There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE). Let's break down this metric starting with the last word, error.

The prediction error for each house is: error=actual−predicted

So, if a house cost $150,000 and you predicted it would cost $100,000 the error is $50,000.

With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

On average, our predictions are off by about X

In the notebook for this module they first load the Melbourne data and create X and y. We'll solve the problem first with sklearn and them with Spark :).

The Problem with "In-Sample" Scores¶

The measure they computed in the original notebook can be called an "in-sample" score. They used a single set of houses (called a data sample) for both building the model and for calculating it's MAE score. This is bad.

Imagine that, in the large real estate market, door color is unrelated to home price. However, in the sample of data you used to build the model, it may be that all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

Since this pattern was originally derived from the training data, the model will appear accurate in the training data.

But this pattern likely won't hold when the model sees new data, and the model would be very inaccurate (and cost us lots of money) when we applied it to our real estate business.

Even a model capturing only happenstance relationships in the data, relationships that will not be repeated when new data, can appear to be very accurate on in-sample accuracy measurements.

Models' practical value come from making predictions on new data, so we should measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called validation data.

The scikit-learn library has a function train_test_split to break up the data into two pieces. We'll see afterwards how to do this in Spark.

Your Turn¶

Use the train_test_split command to split up your data.
Fit the model with the training data
Make predictions with the validation predictors
Calculate the mean absolute error between your predictions and the actual target values for the validation data.

1. Use the train_test_split command to split up your data.¶

In [1]:

import pandas as pd

file_path = 'train.csv'
df = pd.read_csv(file_path) 

y = df.SalePrice

predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
                        'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

X = df[predictors]

In [2]:

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

It should be noted that ty default, the value of the test data is set to 0.25. And of course the training will be 0.75.

2. Fit the model with the training data¶

In [3]:

from sklearn.tree import DecisionTreeRegressor

# Define model
model = DecisionTreeRegressor()
# Fit model
model.fit(train_X, train_y)

Out[3]:

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

3. Make predictions with the validation predictors¶

In [5]:

# get predicted prices on validation data
val_predictions = model.predict(val_X)

4. Calculate the mean absolute error between your predictions and the actual target values for the validation data.¶

In [6]:

from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(val_y, val_predictions))

33606.2356164

Model Validation in Spark¶

The process in spark will be really similar, we first need to build the model as we saw before, but let's start with the data splitting.

In Spark randomSplit randomly splits this DataFrame with the provided weights. The weigths are a list of doubles as weights with which to split the DataFrame. Weights will be normalized if they don’t sum up to 1.0. It's important to set a seed for reproducibility.

In [10]:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

file_path = 'train.csv'
df = spark.read.csv(file_path, header="true", inferSchema=True) 

In [12]:

train, test = df.randomSplit(weights=[0.75,0.25],seed=27)

In [13]:

from pyspark.ml.feature import VectorAssembler, VectorIndexer

# Choosing predictors
features_cols = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
                        'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# This concatenates all feature columns into a single feature vector in a new column "rawFeatures".
vectorAssembler = VectorAssembler(inputCols=features_cols, outputCol="raw_features")
# This identifies categorical features and indexes them.
vectorIndexer = VectorIndexer(inputCol="raw_features", outputCol="features", maxCategories=4)

In [14]:

from pyspark.ml.regression import DecisionTreeRegressor
# Takes the "features" column and learns to predict "SalePrice"
dt = DecisionTreeRegressor(labelCol="SalePrice")

In [15]:

from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[
        vectorAssembler, 
        vectorIndexer, 
        dt
    ])

In [17]:

# Creating a model with the training data
model = pipeline.fit(train)

In [18]:

# Making predictions with the test data
predictions = model.transform(test)

Let's see some of the predictions just to be sure we did a good job above

In [19]:

predictions.select("SalePrice","prediction").show(2)

+---------+----------+
|SalePrice|prediction|
+---------+----------+
|   181500|  167027.0|
|   250000|  290103.0|
+---------+----------+
only showing top 2 rows

Now for the evaluation

In [20]:

from pyspark.ml.evaluation import RegressionEvaluator

# Select (prediction, true label) and compute test error with MAE
evaluator = RegressionEvaluator(
    labelCol="SalePrice", predictionCol="prediction", metricName="mae")

In [23]:

mae = evaluator.evaluate(predictions)
print("Mean Absolute Error (MAE) on test data= {}".format(mae))

Mean Absolute Error (MAE) on test data= 29025.42375836274

So now we see that using this method we got a better result than with sklearn. So this is a good start for Spark :).

Underfitting, Overfitting and Model Optimization¶

Experimenting With Different Models¶

Now that you have a trustworthy way to measure model accuracy, you can experiment with alternative models and see which gives the best predictions. But what alternatives do you have for models?

You can see in scikit-learn's documentation that the decision tree model has many options (more than you'll want or need for a long time). The most important options determine the tree's depth. Recall from page 2 that a trees depth is a measure of how many splits it makes before coming to a prediction. This is a relatively shallow tree

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses and a leaf). As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have 210 groups of houses by the time we get to the 10th level. That's 1024 leaves.

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting.

Example¶

There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

In [1]:

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

Let's load the data again and split it:

In [2]:

import pandas as pd

file_path = 'train.csv'
df = pd.read_csv(file_path) 

y = df.SalePrice

predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
                        'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

X = df[predictors]

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

We can use a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.

In [3]:

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  35190
Max leaf nodes: 50  		 Mean Absolute Error:  27825
Max leaf nodes: 500  		 Mean Absolute Error:  32662
Max leaf nodes: 5000  		 Mean Absolute Error:  33382

We can see that in this case 50 is the optimal number of leaves (it has the lowest MAE).

Conclusion¶

Here's the takeaway: Models can suffer from either:

Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.

We use validation data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.

But we're still using Decision Tree models, which are not very sophisticated by modern machine learning standards.

Underfitting, Overfitting and Model Optimization in Spark¶

We will use the maxDept parameter to optimize because there's not a max_leaf_nodes in Spark.

Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.

In [5]:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

file_path = 'train.csv'
df = spark.read.csv(file_path, header="true", inferSchema=True) 

In [6]:

train, test = df.randomSplit(weights=[0.75,0.25],seed=27)

In [18]:

from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator


def get_mae_spark(max_depth, train_data, test_data):
    # This concatenates all feature columns into a single feature vector in a new column "rawFeatures".
    vectorAssembler = VectorAssembler(inputCols=features_cols, outputCol="raw_features")
    # This identifies categorical features and indexes them.
    vectorIndexer = VectorIndexer(inputCol="raw_features", outputCol="features", maxCategories=4)
    # Takes the "features" column and learns to predict "SalePrice"
    dt = DecisionTreeRegressor(labelCol="SalePrice", maxDepth=max_depth)
    
    pipeline = Pipeline(stages=[
        vectorAssembler, 
        vectorIndexer, 
        dt
    ])
    
    # Creating a model with the training data
    model = pipeline.fit(train_data)
    # Making predictions with the test data
    predictions = model.transform(test_data)
    # Select (prediction, true label) and compute test error with MAE
    evaluator = RegressionEvaluator(labelCol="SalePrice", predictionCol="prediction", metricName="mae")
    mae = evaluator.evaluate(predictions)
    
    return mae

In [20]:

# compare MAE with differing values of max_leaf_nodes
for max_depth in [5, 10, 15, 20]:
    my_mae = get_mae_spark(max_depth, train, test)
    print("Max depth: {}  \t\t Mean Absolute Error:  {}".format(max_depth, my_mae))

Max depth: 5  		 Mean Absolute Error:  29025.42375836274
Max depth: 10  		 Mean Absolute Error:  28375.668652125525
Max depth: 15  		 Mean Absolute Error:  30544.50034435261
Max depth: 20  		 Mean Absolute Error:  30883.51698806244

Here the best maxDepth is 10 :)