The first thing you'll want to do is familiarize yourself with the data. You'll use the Pandas library for this. Pandas is the primary tool that modern data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as pd. We do this with the command
import pandas as pd
The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database. The Pandas DataFrame has powerful methods for most things you'll want to do with this type of data. Let's start by looking at a basic data overview with our example data from Melbourne and the data you'll be working with from Iowa.
We load and explore the data with the following:
# save filepath to variable for easier access
file_path = 'train.csv'
# read the data and store data in DataFrame titled df
df = pd.read_csv(file_path)
# show the data
df
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 | 6 | 50 | RL | 85.0 | 14115 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
6 | 7 | 20 | RL | 75.0 | 10084 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 307000 |
7 | 8 | 60 | RL | NaN | 10382 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | Shed | 350 | 11 | 2009 | WD | Normal | 200000 |
8 | 9 | 50 | RM | 51.0 | 6120 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2008 | WD | Abnorml | 129900 |
9 | 10 | 190 | RL | 50.0 | 7420 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 1 | 2008 | WD | Normal | 118000 |
10 | 11 | 20 | RL | 70.0 | 11200 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 129500 |
11 | 12 | 60 | RL | 85.0 | 11924 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 7 | 2006 | New | Partial | 345000 |
12 | 13 | 20 | RL | NaN | 12968 | Pave | NaN | IR2 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 144000 |
13 | 14 | 20 | RL | 91.0 | 10652 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | New | Partial | 279500 |
14 | 15 | 20 | RL | NaN | 10920 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | GdWo | NaN | 0 | 5 | 2008 | WD | Normal | 157000 |
15 | 16 | 45 | RM | 51.0 | 6120 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | NaN | 0 | 7 | 2007 | WD | Normal | 132000 |
16 | 17 | 20 | RL | NaN | 11241 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | Shed | 700 | 3 | 2010 | WD | Normal | 149000 |
17 | 18 | 90 | RL | 72.0 | 10791 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | Shed | 500 | 10 | 2006 | WD | Normal | 90000 |
18 | 19 | 20 | RL | 66.0 | 13695 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 159000 |
19 | 20 | 20 | RL | 70.0 | 7560 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 5 | 2009 | COD | Abnorml | 139000 |
20 | 21 | 60 | RL | 101.0 | 14215 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 11 | 2006 | New | Partial | 325300 |
21 | 22 | 45 | RM | 57.0 | 7449 | Pave | Grvl | Reg | Bnk | AllPub | ... | 0 | NaN | GdPrv | NaN | 0 | 6 | 2007 | WD | Normal | 139400 |
22 | 23 | 20 | RL | 75.0 | 9742 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 230000 |
23 | 24 | 120 | RM | 44.0 | 4224 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2007 | WD | Normal | 129900 |
24 | 25 | 20 | RL | NaN | 8246 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 5 | 2010 | WD | Normal | 154000 |
25 | 26 | 20 | RL | 110.0 | 14230 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 7 | 2009 | WD | Normal | 256300 |
26 | 27 | 20 | RL | 60.0 | 7200 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2010 | WD | Normal | 134800 |
27 | 28 | 20 | RL | 98.0 | 11478 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2010 | WD | Normal | 306000 |
28 | 29 | 20 | RL | 47.0 | 16321 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2006 | WD | Normal | 207500 |
29 | 30 | 30 | RM | 60.0 | 6324 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2008 | WD | Normal | 68500 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1430 | 1431 | 60 | RL | 60.0 | 21930 | Pave | NaN | IR3 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 7 | 2006 | WD | Normal | 192140 |
1431 | 1432 | 120 | RL | NaN | 4928 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 10 | 2009 | WD | Normal | 143750 |
1432 | 1433 | 30 | RL | 60.0 | 10800 | Pave | Grvl | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 64500 |
1433 | 1434 | 60 | RL | 93.0 | 10261 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2008 | WD | Normal | 186500 |
1434 | 1435 | 20 | RL | 80.0 | 17400 | Pave | NaN | Reg | Low | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2006 | WD | Normal | 160000 |
1435 | 1436 | 20 | RL | 80.0 | 8400 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | NaN | 0 | 7 | 2008 | COD | Abnorml | 174000 |
1436 | 1437 | 20 | RL | 60.0 | 9000 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdWo | NaN | 0 | 5 | 2007 | WD | Normal | 120500 |
1437 | 1438 | 20 | RL | 96.0 | 12444 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 11 | 2008 | New | Partial | 394617 |
1438 | 1439 | 20 | RM | 90.0 | 7407 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 4 | 2010 | WD | Normal | 149700 |
1439 | 1440 | 60 | RL | 80.0 | 11584 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 11 | 2007 | WD | Normal | 197000 |
1440 | 1441 | 70 | RL | 79.0 | 11526 | Pave | NaN | IR1 | Bnk | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 191000 |
1441 | 1442 | 120 | RM | NaN | 4426 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2008 | WD | Normal | 149300 |
1442 | 1443 | 60 | FV | 85.0 | 11003 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2009 | WD | Normal | 310000 |
1443 | 1444 | 30 | RL | NaN | 8854 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2009 | WD | Normal | 121000 |
1444 | 1445 | 20 | RL | 63.0 | 8500 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 11 | 2007 | WD | Normal | 179600 |
1445 | 1446 | 85 | RL | 70.0 | 8400 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 129000 |
1446 | 1447 | 20 | RL | NaN | 26142 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 157900 |
1447 | 1448 | 60 | RL | 80.0 | 10000 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2007 | WD | Normal | 240000 |
1448 | 1449 | 50 | RL | 70.0 | 11767 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdWo | NaN | 0 | 5 | 2007 | WD | Normal | 112000 |
1449 | 1450 | 180 | RM | 21.0 | 1533 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2006 | WD | Abnorml | 92000 |
1450 | 1451 | 90 | RL | 60.0 | 9000 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2009 | WD | Normal | 136000 |
1451 | 1452 | 20 | RL | 78.0 | 9262 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2009 | New | Partial | 287090 |
1452 | 1453 | 180 | RM | 35.0 | 3675 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2006 | WD | Normal | 145000 |
1453 | 1454 | 20 | RL | 90.0 | 17217 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 7 | 2006 | WD | Abnorml | 84500 |
1454 | 1455 | 20 | FV | 62.0 | 7500 | Pave | Pave | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 10 | 2009 | WD | Normal | 185000 |
1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 175000 |
1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal | 210000 |
1457 | 1458 | 70 | RL | 66.0 | 9042 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | Shed | 2500 | 5 | 2010 | WD | Normal | 266500 |
1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 142125 |
1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 147500 |
1460 rows × 81 columns
# print a summary of the data in train data
print(df.describe())
Id MSSubClass LotFrontage LotArea OverallQual \ count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000 mean 730.500000 56.897260 70.049958 10516.828082 6.099315 std 421.610009 42.300571 24.284752 9981.264932 1.382997 min 1.000000 20.000000 21.000000 1300.000000 1.000000 25% 365.750000 20.000000 59.000000 7553.500000 5.000000 50% 730.500000 50.000000 69.000000 9478.500000 6.000000 75% 1095.250000 70.000000 80.000000 11601.500000 7.000000 max 1460.000000 190.000000 313.000000 215245.000000 10.000000 OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 \ count 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 mean 5.575342 1971.267808 1984.865753 103.685262 443.639726 std 1.112799 30.202904 20.645407 181.066207 456.098091 min 1.000000 1872.000000 1950.000000 0.000000 0.000000 25% 5.000000 1954.000000 1967.000000 0.000000 0.000000 50% 5.000000 1973.000000 1994.000000 0.000000 383.500000 75% 6.000000 2000.000000 2004.000000 166.000000 712.250000 max 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch \ count ... 1460.000000 1460.000000 1460.000000 1460.000000 mean ... 94.244521 46.660274 21.954110 3.409589 std ... 125.338794 66.256028 61.119149 29.317331 min ... 0.000000 0.000000 0.000000 0.000000 25% ... 0.000000 0.000000 0.000000 0.000000 50% ... 0.000000 25.000000 0.000000 0.000000 75% ... 168.000000 68.000000 0.000000 0.000000 max ... 857.000000 547.000000 552.000000 508.000000 ScreenPorch PoolArea MiscVal MoSold YrSold \ count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 mean 15.060959 2.758904 43.489041 6.321918 2007.815753 std 55.757415 40.177307 496.123024 2.703626 1.328095 min 0.000000 0.000000 0.000000 1.000000 2006.000000 25% 0.000000 0.000000 0.000000 5.000000 2007.000000 50% 0.000000 0.000000 0.000000 6.000000 2008.000000 75% 0.000000 0.000000 0.000000 8.000000 2009.000000 max 480.000000 738.000000 15500.000000 12.000000 2010.000000 SalePrice count 1460.000000 mean 180921.195890 std 79442.502883 min 34900.000000 25% 129975.000000 50% 163000.000000 75% 214000.000000 max 755000.000000 [8 rows x 38 columns]
The results show 8 numbers for each column in your original dataset. The first number, the count, shows how many rows have non-missing values.
The second value is the mean, which is the average. Under that, sd is the standard deviation, which measures how numerically spread out the values are.
To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analgously, and the max is the largest number.
Optimus (By Iron)is the missing framework for cleaning, pre-processing and exploring data in a distributed fashion. It uses all the power of Apache Spark (optimized via Catalyst) and Python to do so. It implements several handy tools for data wrangling and munging that will make your life much easier. The first obvious advantage over any other public data cleaning library or framework is that it will work on your laptop or your big cluster, and second, it is amazingly easy to install, use and understand.
!pip install optimuspyspark
Requirement already satisfied: optimuspyspark in /Users/faviovazquez/anaconda/lib/python3.6/site-packages Requirement already satisfied: pytest in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark) Requirement already satisfied: pixiedust-optimus in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark) Requirement already satisfied: spark-df-profiling-optimus in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark) Requirement already satisfied: pytest-spark in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark) Requirement already satisfied: seaborn in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark) Requirement already satisfied: findspark in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark) Requirement already satisfied: pyspark in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from optimuspyspark) Requirement already satisfied: py>=1.4.29 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from pytest->optimuspyspark) Requirement already satisfied: setuptools in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from pytest->optimuspyspark) Requirement already satisfied: geojson in /Users/faviovazquez/.local/lib/python3.6/site-packages (from pixiedust-optimus->optimuspyspark) Requirement already satisfied: lxml in /Users/faviovazquez/.local/lib/python3.6/site-packages (from pixiedust-optimus->optimuspyspark) Requirement already satisfied: mpld3 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from pixiedust-optimus->optimuspyspark) Requirement already satisfied: jinja2>=2.8 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from spark-df-profiling-optimus->optimuspyspark) Requirement already satisfied: pandas>=0.17.0 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from spark-df-profiling-optimus->optimuspyspark) Requirement already satisfied: matplotlib>=2.0 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from spark-df-profiling-optimus->optimuspyspark) Requirement already satisfied: six>=1.9.0 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from spark-df-profiling-optimus->optimuspyspark) Requirement already satisfied: numpy in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from seaborn->optimuspyspark) Requirement already satisfied: scipy in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from seaborn->optimuspyspark) Requirement already satisfied: py4j==0.10.4 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from pyspark->optimuspyspark) Requirement already satisfied: MarkupSafe>=0.23 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from jinja2>=2.8->spark-df-profiling-optimus->optimuspyspark) Requirement already satisfied: python-dateutil>=2 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from pandas>=0.17.0->spark-df-profiling-optimus->optimuspyspark) Requirement already satisfied: pytz>=2011k in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from pandas>=0.17.0->spark-df-profiling-optimus->optimuspyspark) Requirement already satisfied: cycler>=0.10 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from matplotlib>=2.0->spark-df-profiling-optimus->optimuspyspark) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/faviovazquez/anaconda/lib/python3.6/site-packages (from matplotlib>=2.0->spark-df-profiling-optimus->optimuspyspark)
Using Optimus is really easy, and you have spark underneath :)
import optimus as op
tools = op.Utilities()
df = tools.read_csv("train.csv")
df.show()
+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+ | Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition2|BldgType|HouseStyle|OverallQual|OverallCond|YearBuilt|YearRemodAdd|RoofStyle|RoofMatl|Exterior1st|Exterior2nd|MasVnrType|MasVnrArea|ExterQual|ExterCond|Foundation|BsmtQual|BsmtCond|BsmtExposure|BsmtFinType1|BsmtFinSF1|BsmtFinType2|BsmtFinSF2|BsmtUnfSF|TotalBsmtSF|Heating|HeatingQC|CentralAir|Electrical|1stFlrSF|2ndFlrSF|LowQualFinSF|GrLivArea|BsmtFullBath|BsmtHalfBath|FullBath|HalfBath|BedroomAbvGr|KitchenAbvGr|KitchenQual|TotRmsAbvGrd|Functional|Fireplaces|FireplaceQu|GarageType|GarageYrBlt|GarageFinish|GarageCars|GarageArea|GarageQual|GarageCond|PavedDrive|WoodDeckSF|OpenPorchSF|EnclosedPorch|3SsnPorch|ScreenPorch|PoolArea|PoolQC|Fence|MiscFeature|MiscVal|MoSold|YrSold|SaleType|SaleCondition|SalePrice| +---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+ | 1| 60| RL| 65| 8450| Pave| NA| Reg| Lvl| AllPub| Inside| Gtl| CollgCr| Norm| Norm| 1Fam| 2Story| 7| 5| 2003| 2003| Gable| CompShg| VinylSd| VinylSd| BrkFace| 196| Gd| TA| PConc| Gd| TA| No| GLQ| 706| Unf| 0| 150| 856| GasA| Ex| Y| SBrkr| 856| 854| 0| 1710| 1| 0| 2| 1| 3| 1| Gd| 8| Typ| 0| NA| Attchd| 2003| RFn| 2| 548| TA| TA| Y| 0| 61| 0| 0| 0| 0| NA| NA| NA| 0| 2| 2008| WD| Normal| 208500| | 2| 20| RL| 80| 9600| Pave| NA| Reg| Lvl| AllPub| FR2| Gtl| Veenker| Feedr| Norm| 1Fam| 1Story| 6| 8| 1976| 1976| Gable| CompShg| MetalSd| MetalSd| None| 0| TA| TA| CBlock| Gd| TA| Gd| ALQ| 978| Unf| 0| 284| 1262| GasA| Ex| Y| SBrkr| 1262| 0| 0| 1262| 0| 1| 2| 0| 3| 1| TA| 6| Typ| 1| TA| Attchd| 1976| RFn| 2| 460| TA| TA| Y| 298| 0| 0| 0| 0| 0| NA| NA| NA| 0| 5| 2007| WD| Normal| 181500| | 3| 60| RL| 68| 11250| Pave| NA| IR1| Lvl| AllPub| Inside| Gtl| CollgCr| Norm| Norm| 1Fam| 2Story| 7| 5| 2001| 2002| Gable| CompShg| VinylSd| VinylSd| BrkFace| 162| Gd| TA| PConc| Gd| TA| Mn| GLQ| 486| Unf| 0| 434| 920| GasA| Ex| Y| SBrkr| 920| 866| 0| 1786| 1| 0| 2| 1| 3| 1| Gd| 6| Typ| 1| TA| Attchd| 2001| RFn| 2| 608| TA| TA| Y| 0| 42| 0| 0| 0| 0| NA| NA| NA| 0| 9| 2008| WD| Normal| 223500| | 4| 70| RL| 60| 9550| Pave| NA| IR1| Lvl| AllPub| Corner| Gtl| Crawfor| Norm| Norm| 1Fam| 2Story| 7| 5| 1915| 1970| Gable| CompShg| Wd Sdng| Wd Shng| None| 0| TA| TA| BrkTil| TA| Gd| No| ALQ| 216| Unf| 0| 540| 756| GasA| Gd| Y| SBrkr| 961| 756| 0| 1717| 1| 0| 1| 0| 3| 1| Gd| 7| Typ| 1| Gd| Detchd| 1998| Unf| 3| 642| TA| TA| Y| 0| 35| 272| 0| 0| 0| NA| NA| NA| 0| 2| 2006| WD| Abnorml| 140000| | 5| 60| RL| 84| 14260| Pave| NA| IR1| Lvl| AllPub| FR2| Gtl| NoRidge| Norm| Norm| 1Fam| 2Story| 8| 5| 2000| 2000| Gable| CompShg| VinylSd| VinylSd| BrkFace| 350| Gd| TA| PConc| Gd| TA| Av| GLQ| 655| Unf| 0| 490| 1145| GasA| Ex| Y| SBrkr| 1145| 1053| 0| 2198| 1| 0| 2| 1| 4| 1| Gd| 9| Typ| 1| TA| Attchd| 2000| RFn| 3| 836| TA| TA| Y| 192| 84| 0| 0| 0| 0| NA| NA| NA| 0| 12| 2008| WD| Normal| 250000| | 6| 50| RL| 85| 14115| Pave| NA| IR1| Lvl| AllPub| Inside| Gtl| Mitchel| Norm| Norm| 1Fam| 1.5Fin| 5| 5| 1993| 1995| Gable| CompShg| VinylSd| VinylSd| None| 0| TA| TA| Wood| Gd| TA| No| GLQ| 732| Unf| 0| 64| 796| GasA| Ex| Y| SBrkr| 796| 566| 0| 1362| 1| 0| 1| 1| 1| 1| TA| 5| Typ| 0| NA| Attchd| 1993| Unf| 2| 480| TA| TA| Y| 40| 30| 0| 320| 0| 0| NA|MnPrv| Shed| 700| 10| 2009| WD| Normal| 143000| | 7| 20| RL| 75| 10084| Pave| NA| Reg| Lvl| AllPub| Inside| Gtl| Somerst| Norm| Norm| 1Fam| 1Story| 8| 5| 2004| 2005| Gable| CompShg| VinylSd| VinylSd| Stone| 186| Gd| TA| PConc| Ex| TA| Av| GLQ| 1369| Unf| 0| 317| 1686| GasA| Ex| Y| SBrkr| 1694| 0| 0| 1694| 1| 0| 2| 0| 3| 1| Gd| 7| Typ| 1| Gd| Attchd| 2004| RFn| 2| 636| TA| TA| Y| 255| 57| 0| 0| 0| 0| NA| NA| NA| 0| 8| 2007| WD| Normal| 307000| | 8| 60| RL| NA| 10382| Pave| NA| IR1| Lvl| AllPub| Corner| Gtl| NWAmes| PosN| Norm| 1Fam| 2Story| 7| 6| 1973| 1973| Gable| CompShg| HdBoard| HdBoard| Stone| 240| TA| TA| CBlock| Gd| TA| Mn| ALQ| 859| BLQ| 32| 216| 1107| GasA| Ex| Y| SBrkr| 1107| 983| 0| 2090| 1| 0| 2| 1| 3| 1| TA| 7| Typ| 2| TA| Attchd| 1973| RFn| 2| 484| TA| TA| Y| 235| 204| 228| 0| 0| 0| NA| NA| Shed| 350| 11| 2009| WD| Normal| 200000| | 9| 50| RM| 51| 6120| Pave| NA| Reg| Lvl| AllPub| Inside| Gtl| OldTown| Artery| Norm| 1Fam| 1.5Fin| 7| 5| 1931| 1950| Gable| CompShg| BrkFace| Wd Shng| None| 0| TA| TA| BrkTil| TA| TA| No| Unf| 0| Unf| 0| 952| 952| GasA| Gd| Y| FuseF| 1022| 752| 0| 1774| 0| 0| 2| 0| 2| 2| TA| 8| Min1| 2| TA| Detchd| 1931| Unf| 2| 468| Fa| TA| Y| 90| 0| 205| 0| 0| 0| NA| NA| NA| 0| 4| 2008| WD| Abnorml| 129900| | 10| 190| RL| 50| 7420| Pave| NA| Reg| Lvl| AllPub| Corner| Gtl| BrkSide| Artery| Artery| 2fmCon| 1.5Unf| 5| 6| 1939| 1950| Gable| CompShg| MetalSd| MetalSd| None| 0| TA| TA| BrkTil| TA| TA| No| GLQ| 851| Unf| 0| 140| 991| GasA| Ex| Y| SBrkr| 1077| 0| 0| 1077| 1| 0| 1| 0| 2| 2| TA| 5| Typ| 2| TA| Attchd| 1939| RFn| 1| 205| Gd| TA| Y| 0| 4| 0| 0| 0| 0| NA| NA| NA| 0| 1| 2008| WD| Normal| 118000| | 11| 20| RL| 70| 11200| Pave| NA| Reg| Lvl| AllPub| Inside| Gtl| Sawyer| Norm| Norm| 1Fam| 1Story| 5| 5| 1965| 1965| Hip| CompShg| HdBoard| HdBoard| None| 0| TA| TA| CBlock| TA| TA| No| Rec| 906| Unf| 0| 134| 1040| GasA| Ex| Y| SBrkr| 1040| 0| 0| 1040| 1| 0| 1| 0| 3| 1| TA| 5| Typ| 0| NA| Detchd| 1965| Unf| 1| 384| TA| TA| Y| 0| 0| 0| 0| 0| 0| NA| NA| NA| 0| 2| 2008| WD| Normal| 129500| | 12| 60| RL| 85| 11924| Pave| NA| IR1| Lvl| AllPub| Inside| Gtl| NridgHt| Norm| Norm| 1Fam| 2Story| 9| 5| 2005| 2006| Hip| CompShg| WdShing| Wd Shng| Stone| 286| Ex| TA| PConc| Ex| TA| No| GLQ| 998| Unf| 0| 177| 1175| GasA| Ex| Y| SBrkr| 1182| 1142| 0| 2324| 1| 0| 3| 0| 4| 1| Ex| 11| Typ| 2| Gd| BuiltIn| 2005| Fin| 3| 736| TA| TA| Y| 147| 21| 0| 0| 0| 0| NA| NA| NA| 0| 7| 2006| New| Partial| 345000| | 13| 20| RL| NA| 12968| Pave| NA| IR2| Lvl| AllPub| Inside| Gtl| Sawyer| Norm| Norm| 1Fam| 1Story| 5| 6| 1962| 1962| Hip| CompShg| HdBoard| Plywood| None| 0| TA| TA| CBlock| TA| TA| No| ALQ| 737| Unf| 0| 175| 912| GasA| TA| Y| SBrkr| 912| 0| 0| 912| 1| 0| 1| 0| 2| 1| TA| 4| Typ| 0| NA| Detchd| 1962| Unf| 1| 352| TA| TA| Y| 140| 0| 0| 0| 176| 0| NA| NA| NA| 0| 9| 2008| WD| Normal| 144000| | 14| 20| RL| 91| 10652| Pave| NA| IR1| Lvl| AllPub| Inside| Gtl| CollgCr| Norm| Norm| 1Fam| 1Story| 7| 5| 2006| 2007| Gable| CompShg| VinylSd| VinylSd| Stone| 306| Gd| TA| PConc| Gd| TA| Av| Unf| 0| Unf| 0| 1494| 1494| GasA| Ex| Y| SBrkr| 1494| 0| 0| 1494| 0| 0| 2| 0| 3| 1| Gd| 7| Typ| 1| Gd| Attchd| 2006| RFn| 3| 840| TA| TA| Y| 160| 33| 0| 0| 0| 0| NA| NA| NA| 0| 8| 2007| New| Partial| 279500| | 15| 20| RL| NA| 10920| Pave| NA| IR1| Lvl| AllPub| Corner| Gtl| NAmes| Norm| Norm| 1Fam| 1Story| 6| 5| 1960| 1960| Hip| CompShg| MetalSd| MetalSd| BrkFace| 212| TA| TA| CBlock| TA| TA| No| BLQ| 733| Unf| 0| 520| 1253| GasA| TA| Y| SBrkr| 1253| 0| 0| 1253| 1| 0| 1| 1| 2| 1| TA| 5| Typ| 1| Fa| Attchd| 1960| RFn| 1| 352| TA| TA| Y| 0| 213| 176| 0| 0| 0| NA| GdWo| NA| 0| 5| 2008| WD| Normal| 157000| | 16| 45| RM| 51| 6120| Pave| NA| Reg| Lvl| AllPub| Corner| Gtl| BrkSide| Norm| Norm| 1Fam| 1.5Unf| 7| 8| 1929| 2001| Gable| CompShg| Wd Sdng| Wd Sdng| None| 0| TA| TA| BrkTil| TA| TA| No| Unf| 0| Unf| 0| 832| 832| GasA| Ex| Y| FuseA| 854| 0| 0| 854| 0| 0| 1| 0| 2| 1| TA| 5| Typ| 0| NA| Detchd| 1991| Unf| 2| 576| TA| TA| Y| 48| 112| 0| 0| 0| 0| NA|GdPrv| NA| 0| 7| 2007| WD| Normal| 132000| | 17| 20| RL| NA| 11241| Pave| NA| IR1| Lvl| AllPub| CulDSac| Gtl| NAmes| Norm| Norm| 1Fam| 1Story| 6| 7| 1970| 1970| Gable| CompShg| Wd Sdng| Wd Sdng| BrkFace| 180| TA| TA| CBlock| TA| TA| No| ALQ| 578| Unf| 0| 426| 1004| GasA| Ex| Y| SBrkr| 1004| 0| 0| 1004| 1| 0| 1| 0| 2| 1| TA| 5| Typ| 1| TA| Attchd| 1970| Fin| 2| 480| TA| TA| Y| 0| 0| 0| 0| 0| 0| NA| NA| Shed| 700| 3| 2010| WD| Normal| 149000| | 18| 90| RL| 72| 10791| Pave| NA| Reg| Lvl| AllPub| Inside| Gtl| Sawyer| Norm| Norm| Duplex| 1Story| 4| 5| 1967| 1967| Gable| CompShg| MetalSd| MetalSd| None| 0| TA| TA| Slab| NA| NA| NA| NA| 0| NA| 0| 0| 0| GasA| TA| Y| SBrkr| 1296| 0| 0| 1296| 0| 0| 2| 0| 2| 2| TA| 6| Typ| 0| NA| CarPort| 1967| Unf| 2| 516| TA| TA| Y| 0| 0| 0| 0| 0| 0| NA| NA| Shed| 500| 10| 2006| WD| Normal| 90000| | 19| 20| RL| 66| 13695| Pave| NA| Reg| Lvl| AllPub| Inside| Gtl| SawyerW| RRAe| Norm| 1Fam| 1Story| 5| 5| 2004| 2004| Gable| CompShg| VinylSd| VinylSd| None| 0| TA| TA| PConc| TA| TA| No| GLQ| 646| Unf| 0| 468| 1114| GasA| Ex| Y| SBrkr| 1114| 0| 0| 1114| 1| 0| 1| 1| 3| 1| Gd| 6| Typ| 0| NA| Detchd| 2004| Unf| 2| 576| TA| TA| Y| 0| 102| 0| 0| 0| 0| NA| NA| NA| 0| 6| 2008| WD| Normal| 159000| | 20| 20| RL| 70| 7560| Pave| NA| Reg| Lvl| AllPub| Inside| Gtl| NAmes| Norm| Norm| 1Fam| 1Story| 5| 6| 1958| 1965| Hip| CompShg| BrkFace| Plywood| None| 0| TA| TA| CBlock| TA| TA| No| LwQ| 504| Unf| 0| 525| 1029| GasA| TA| Y| SBrkr| 1339| 0| 0| 1339| 0| 0| 1| 0| 3| 1| TA| 6| Min1| 0| NA| Attchd| 1958| Unf| 1| 294| TA| TA| Y| 0| 0| 0| 0| 0| 0| NA|MnPrv| NA| 0| 5| 2009| COD| Abnorml| 139000| +---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+ only showing top 20 rows
profiler = op.DataFrameProfiler(df)
profiler.profiler()
Dataset info
Number of variables | 81 |
---|---|
Number of observations | 1460 |
Total Missing (%) | 0.0% |
Total size in memory | 0.0 B |
Average record size in memory | 0.0 B |
Variables types
Numeric | 35 |
---|---|
Categorical | 46 |
Date | 0 |
Text (Unique) | 0 |
Rejected | 0 |
Warnings
2ndFlrSF
has 829 / 56.8% zeros3SsnPorch
has 1436 / 98.4% zerosBsmtFinSF1
has 467 / 32.0% zerosBsmtFinSF2
has 1293 / 88.6% zerosBsmtFullBath
has 856 / 58.6% zerosBsmtHalfBath
has 1378 / 94.4% zerosBsmtUnfSF
has 118 / 8.1% zerosEnclosedPorch
has 1252 / 85.8% zerosFireplaces
has 690 / 47.3% zerosGarageArea
has 81 / 5.5% zerosGarageCars
has 81 / 5.5% zerosGarageYrBlt
has a high cardinality: 98 distinct values WarningHalfBath
has 913 / 62.5% zerosLotFrontage
has a high cardinality: 111 distinct values WarningLowQualFinSF
has 1434 / 98.2% zerosMasVnrArea
has a high cardinality: 328 distinct values WarningMiscVal
has 1408 / 96.4% zerosMiscVal
is highly skewed (γ1 = 24.452)OpenPorchSF
has 656 / 44.9% zerosPoolArea
has 1453 / 99.5% zerosScreenPorch
has 1344 / 92.1% zerosTotalBsmtSF
has 37 / 2.5% zerosWoodDeckSF
has 761 / 52.1% zeros1stFlrSF
Numeric
Distinct count | 753 |
---|---|
Unique (%) | 51.6% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 1162.6 |
---|---|
Minimum | 334 |
Maximum | 4692 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 334 |
---|---|
5-th percentile | 672.95 |
Q1 | 882 |
Median | 1087 |
Q3 | 1391.2 |
95-th percentile | 1831.2 |
Maximum | 4692 |
Range | 4358 |
Interquartile range | 509.25 |
Descriptive statistics
Standard deviation | 386.59 |
---|---|
Coef of variation | 0.33251 |
Kurtosis | 5.7221 |
Mean | 1162.6 |
MAD | 300.58 |
Skewness | 1.3753 |
Sum | 1697400 |
Variance | 149450 |
Memory size | 0.0 B |
2ndFlrSF
Numeric
Distinct count | 417 |
---|---|
Unique (%) | 28.6% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 346.99 |
---|---|
Minimum | 0 |
Maximum | 2065 |
Zeros (%) | 56.8% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 728 |
95-th percentile | 1141 |
Maximum | 2065 |
Range | 2065 |
Interquartile range | 728 |
Descriptive statistics
Standard deviation | 436.53 |
---|---|
Coef of variation | 1.258 |
Kurtosis | -0.55568 |
Mean | 346.99 |
MAD | 396.48 |
Skewness | 0.81219 |
Sum | 506610 |
Variance | 190560 |
Memory size | 0.0 B |
3SsnPorch
Numeric
Distinct count | 20 |
---|---|
Unique (%) | 1.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 3.4096 |
---|---|
Minimum | 0 |
Maximum | 508 |
Zeros (%) | 98.4% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 0 |
95-th percentile | 0 |
Maximum | 508 |
Range | 508 |
Interquartile range | 0 |
Descriptive statistics
Standard deviation | 29.317 |
---|---|
Coef of variation | 8.5985 |
Kurtosis | 123.24 |
Mean | 3.4096 |
MAD | 6.7071 |
Skewness | 10.294 |
Sum | 4978 |
Variance | 859.51 |
Memory size | 0.0 B |
Alley
Categorical
Distinct count | 3 |
---|---|
Unique (%) | 0.2% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
NA | |
---|---|
Grvl | 50 |
Pave | 41 |
Value | Count | Frequency (%) | |
NA | 1369 | 93.8% | |
Grvl | 50 | 3.4% | |
Pave | 41 | 2.8% |
BedroomAbvGr
Numeric
Distinct count | 8 |
---|---|
Unique (%) | 0.5% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 2.8664 |
---|---|
Minimum | 0 |
Maximum | 8 |
Zeros (%) | 0.4% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 2 |
Q1 | 2 |
Median | 3 |
Q3 | 3 |
95-th percentile | 4 |
Maximum | 8 |
Range | 8 |
Interquartile range | 1 |
Descriptive statistics
Standard deviation | 0.81578 |
---|---|
Coef of variation | 0.2846 |
Kurtosis | 2.2191 |
Mean | 2.8664 |
MAD | 0.57631 |
Skewness | 0.21157 |
Sum | 4185 |
Variance | 0.66549 |
Memory size | 0.0 B |
BldgType
Categorical
Distinct count | 5 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
1Fam | |
---|---|
TwnhsE | 114 |
Duplex | 52 |
Other values (2) | 74 |
Value | Count | Frequency (%) | |
1Fam | 1220 | 83.6% | |
TwnhsE | 114 | 7.8% | |
Duplex | 52 | 3.6% | |
Twnhs | 43 | 2.9% | |
2fmCon | 31 | 2.1% |
BsmtCond
Categorical
Distinct count | 5 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
TA | |
---|---|
Gd | 65 |
Fa | 45 |
Other values (2) | 39 |
Value | Count | Frequency (%) | |
TA | 1311 | 89.8% | |
Gd | 65 | 4.5% | |
Fa | 45 | 3.1% | |
NA | 37 | 2.5% | |
Po | 2 | 0.1% |
BsmtExposure
Categorical
Distinct count | 5 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
No | |
---|---|
Av | |
Gd | 134 |
Other values (2) | 152 |
Value | Count | Frequency (%) | |
No | 953 | 65.3% | |
Av | 221 | 15.1% | |
Gd | 134 | 9.2% | |
Mn | 114 | 7.8% | |
NA | 38 | 2.6% |
BsmtFinSF1
Numeric
Distinct count | 637 |
---|---|
Unique (%) | 43.6% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 443.64 |
---|---|
Minimum | 0 |
Maximum | 5644 |
Zeros (%) | 32.0% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 383.5 |
Q3 | 712.25 |
95-th percentile | 1274 |
Maximum | 5644 |
Range | 5644 |
Interquartile range | 712.25 |
Descriptive statistics
Standard deviation | 456.1 |
---|---|
Coef of variation | 1.0281 |
Kurtosis | 11.076 |
Mean | 443.64 |
MAD | 367.37 |
Skewness | 1.6838 |
Sum | 647710 |
Variance | 208030 |
Memory size | 0.0 B |
BsmtFinSF2
Numeric
Distinct count | 144 |
---|---|
Unique (%) | 9.9% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 46.549 |
---|---|
Minimum | 0 |
Maximum | 1474 |
Zeros (%) | 88.6% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 0 |
95-th percentile | 396.2 |
Maximum | 1474 |
Range | 1474 |
Interquartile range | 0 |
Descriptive statistics
Standard deviation | 161.32 |
---|---|
Coef of variation | 3.4656 |
Kurtosis | 20.04 |
Mean | 46.549 |
MAD | 82.535 |
Skewness | 4.2509 |
Sum | 67962 |
Variance | 26024 |
Memory size | 0.0 B |
BsmtFinType1
Categorical
Distinct count | 7 |
---|---|
Unique (%) | 0.5% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Unf | |
---|---|
GLQ | |
ALQ | |
Other values (4) |
Value | Count | Frequency (%) | |
Unf | 430 | 29.5% | |
GLQ | 418 | 28.6% | |
ALQ | 220 | 15.1% | |
BLQ | 148 | 10.1% | |
Rec | 133 | 9.1% | |
LwQ | 74 | 5.1% | |
NA | 37 | 2.5% |
BsmtFinType2
Categorical
Distinct count | 7 |
---|---|
Unique (%) | 0.5% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Unf | |
---|---|
Rec | 54 |
LwQ | 46 |
Other values (4) | 104 |
Value | Count | Frequency (%) | |
Unf | 1256 | 86.0% | |
Rec | 54 | 3.7% | |
LwQ | 46 | 3.2% | |
NA | 38 | 2.6% | |
BLQ | 33 | 2.3% | |
ALQ | 19 | 1.3% | |
GLQ | 14 | 1.0% |
BsmtFullBath
Numeric
Distinct count | 4 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 0.42534 |
---|---|
Minimum | 0 |
Maximum | 3 |
Zeros (%) | 58.6% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 1 |
95-th percentile | 1 |
Maximum | 3 |
Range | 3 |
Interquartile range | 1 |
Descriptive statistics
Standard deviation | 0.51891 |
---|---|
Coef of variation | 1.22 |
Kurtosis | -0.84033 |
Mean | 0.42534 |
MAD | 0.49876 |
Skewness | 0.59545 |
Sum | 621 |
Variance | 0.26927 |
Memory size | 0.0 B |
BsmtHalfBath
Numeric
Distinct count | 3 |
---|---|
Unique (%) | 0.2% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 0.057534 |
---|---|
Minimum | 0 |
Maximum | 2 |
Zeros (%) | 94.4% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 0 |
95-th percentile | 1 |
Maximum | 2 |
Range | 2 |
Interquartile range | 0 |
Descriptive statistics
Standard deviation | 0.23875 |
---|---|
Coef of variation | 4.1497 |
Kurtosis | 16.336 |
Mean | 0.057534 |
MAD | 0.10861 |
Skewness | 4.0992 |
Sum | 84 |
Variance | 0.057003 |
Memory size | 0.0 B |
BsmtQual
Categorical
Distinct count | 5 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
TA | |
---|---|
Gd | |
Ex | 121 |
Other values (2) | 72 |
Value | Count | Frequency (%) | |
TA | 649 | 44.5% | |
Gd | 618 | 42.3% | |
Ex | 121 | 8.3% | |
NA | 37 | 2.5% | |
Fa | 35 | 2.4% |
BsmtUnfSF
Numeric
Distinct count | 780 |
---|---|
Unique (%) | 53.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 567.24 |
---|---|
Minimum | 0 |
Maximum | 2336 |
Zeros (%) | 8.1% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 223 |
Median | 477.5 |
Q3 | 808 |
95-th percentile | 1468 |
Maximum | 2336 |
Range | 2336 |
Interquartile range | 585 |
Descriptive statistics
Standard deviation | 441.87 |
---|---|
Coef of variation | 0.77898 |
Kurtosis | 0.46926 |
Mean | 567.24 |
MAD | 353.28 |
Skewness | 0.91932 |
Sum | 828170 |
Variance | 195250 |
Memory size | 0.0 B |
CentralAir
Categorical
Distinct count | 2 |
---|---|
Unique (%) | 0.1% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Y | |
---|---|
N | 95 |
Value | Count | Frequency (%) | |
Y | 1365 | 93.5% | |
N | 95 | 6.5% |
Condition1
Categorical
Distinct count | 9 |
---|---|
Unique (%) | 0.6% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Norm | |
---|---|
Feedr | 81 |
Artery | 48 |
Other values (6) | 71 |
Value | Count | Frequency (%) | |
Norm | 1260 | 86.3% | |
Feedr | 81 | 5.5% | |
Artery | 48 | 3.3% | |
RRAn | 26 | 1.8% | |
PosN | 19 | 1.3% | |
RRAe | 11 | 0.8% | |
PosA | 8 | 0.5% | |
RRNn | 5 | 0.3% | |
RRNe | 2 | 0.1% |
Condition2
Categorical
Distinct count | 8 |
---|---|
Unique (%) | 0.5% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Norm | |
---|---|
Feedr | 6 |
PosN | 2 |
Other values (5) | 7 |
Value | Count | Frequency (%) | |
Norm | 1445 | 99.0% | |
Feedr | 6 | 0.4% | |
PosN | 2 | 0.1% | |
Artery | 2 | 0.1% | |
RRNn | 2 | 0.1% | |
RRAn | 1 | 0.1% | |
PosA | 1 | 0.1% | |
RRAe | 1 | 0.1% |
Electrical
Categorical
Distinct count | 6 |
---|---|
Unique (%) | 0.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
SBrkr | |
---|---|
FuseA | 94 |
FuseF | 27 |
Other values (3) | 5 |
Value | Count | Frequency (%) | |
SBrkr | 1334 | 91.4% | |
FuseA | 94 | 6.4% | |
FuseF | 27 | 1.8% | |
FuseP | 3 | 0.2% | |
Mix | 1 | 0.1% | |
NA | 1 | 0.1% |
EnclosedPorch
Numeric
Distinct count | 120 |
---|---|
Unique (%) | 8.2% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 21.954 |
---|---|
Minimum | 0 |
Maximum | 552 |
Zeros (%) | 85.8% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 0 |
95-th percentile | 180.15 |
Maximum | 552 |
Range | 552 |
Interquartile range | 0 |
Descriptive statistics
Standard deviation | 61.119 |
---|---|
Coef of variation | 2.784 |
Kurtosis | 10.391 |
Mean | 21.954 |
MAD | 37.66 |
Skewness | 3.0867 |
Sum | 32053 |
Variance | 3735.6 |
Memory size | 0.0 B |
ExterCond
Categorical
Distinct count | 5 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
TA | |
---|---|
Gd | 146 |
Fa | 28 |
Other values (2) | 4 |
Value | Count | Frequency (%) | |
TA | 1282 | 87.8% | |
Gd | 146 | 10.0% | |
Fa | 28 | 1.9% | |
Ex | 3 | 0.2% | |
Po | 1 | 0.1% |
ExterQual
Categorical
Distinct count | 4 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
TA | |
---|---|
Gd | |
Ex | 52 |
Value | Count | Frequency (%) | |
TA | 906 | 62.1% | |
Gd | 488 | 33.4% | |
Ex | 52 | 3.6% | |
Fa | 14 | 1.0% |
Exterior1st
Categorical
Distinct count | 15 |
---|---|
Unique (%) | 1.0% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
VinylSd | |
---|---|
HdBoard | |
MetalSd | |
Other values (12) |
Value | Count | Frequency (%) | |
VinylSd | 515 | 35.3% | |
HdBoard | 222 | 15.2% | |
MetalSd | 220 | 15.1% | |
Wd Sdng | 206 | 14.1% | |
Plywood | 108 | 7.4% | |
CemntBd | 61 | 4.2% | |
BrkFace | 50 | 3.4% | |
WdShing | 26 | 1.8% | |
Stucco | 25 | 1.7% | |
AsbShng | 20 | 1.4% | |
Stone | 2 | 0.1% | |
BrkComm | 2 | 0.1% | |
AsphShn | 1 | 0.1% | |
ImStucc | 1 | 0.1% | |
CBlock | 1 | 0.1% |
Exterior2nd
Categorical
Distinct count | 16 |
---|---|
Unique (%) | 1.1% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
VinylSd | |
---|---|
MetalSd | |
HdBoard | |
Other values (13) |
Value | Count | Frequency (%) | |
VinylSd | 504 | 34.5% | |
MetalSd | 214 | 14.7% | |
HdBoard | 207 | 14.2% | |
Wd Sdng | 197 | 13.5% | |
Plywood | 142 | 9.7% | |
CmentBd | 60 | 4.1% | |
Wd Shng | 38 | 2.6% | |
Stucco | 26 | 1.8% | |
BrkFace | 25 | 1.7% | |
AsbShng | 20 | 1.4% | |
ImStucc | 10 | 0.7% | |
Brk Cmn | 7 | 0.5% | |
Stone | 5 | 0.3% | |
AsphShn | 3 | 0.2% | |
Other | 1 | 0.1% | |
CBlock | 1 | 0.1% |
Fence
Categorical
Distinct count | 5 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
NA | |
---|---|
MnPrv | 157 |
GdPrv | 59 |
Other values (2) | 65 |
Value | Count | Frequency (%) | |
NA | 1179 | 80.8% | |
MnPrv | 157 | 10.8% | |
GdPrv | 59 | 4.0% | |
GdWo | 54 | 3.7% | |
MnWw | 11 | 0.8% |
FireplaceQu
Categorical
Distinct count | 6 |
---|---|
Unique (%) | 0.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
NA | |
---|---|
Gd | |
TA | |
Other values (3) | 77 |
Value | Count | Frequency (%) | |
NA | 690 | 47.3% | |
Gd | 380 | 26.0% | |
TA | 313 | 21.4% | |
Fa | 33 | 2.3% | |
Ex | 24 | 1.6% | |
Po | 20 | 1.4% |
Fireplaces
Numeric
Distinct count | 4 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 0.61301 |
---|---|
Minimum | 0 |
Maximum | 3 |
Zeros (%) | 47.3% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 1 |
Q3 | 1 |
95-th percentile | 2 |
Maximum | 3 |
Range | 3 |
Interquartile range | 1 |
Descriptive statistics
Standard deviation | 0.64467 |
---|---|
Coef of variation | 1.0516 |
Kurtosis | -0.2206 |
Mean | 0.61301 |
MAD | 0.57942 |
Skewness | 0.6489 |
Sum | 895 |
Variance | 0.41559 |
Memory size | 0.0 B |
Foundation
Categorical
Distinct count | 6 |
---|---|
Unique (%) | 0.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
PConc | |
---|---|
CBlock | |
BrkTil | |
Other values (3) | 33 |
Value | Count | Frequency (%) | |
PConc | 647 | 44.3% | |
CBlock | 634 | 43.4% | |
BrkTil | 146 | 10.0% | |
Slab | 24 | 1.6% | |
Stone | 6 | 0.4% | |
Wood | 3 | 0.2% |
FullBath
Numeric
Distinct count | 4 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 1.5651 |
---|---|
Minimum | 0 |
Maximum | 3 |
Zeros (%) | 0.6% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 1 |
Q1 | 1 |
Median | 2 |
Q3 | 2 |
95-th percentile | 2 |
Maximum | 3 |
Range | 3 |
Interquartile range | 1 |
Descriptive statistics
Standard deviation | 0.55092 |
---|---|
Coef of variation | 0.35201 |
Kurtosis | -0.85822 |
Mean | 1.5651 |
MAD | 0.52244 |
Skewness | 0.036524 |
Sum | 2285 |
Variance | 0.30351 |
Memory size | 0.0 B |
Functional
Categorical
Distinct count | 7 |
---|---|
Unique (%) | 0.5% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Typ | |
---|---|
Min2 | 34 |
Min1 | 31 |
Other values (4) | 35 |
Value | Count | Frequency (%) | |
Typ | 1360 | 93.2% | |
Min2 | 34 | 2.3% | |
Min1 | 31 | 2.1% | |
Mod | 15 | 1.0% | |
Maj1 | 14 | 1.0% | |
Maj2 | 5 | 0.3% | |
Sev | 1 | 0.1% |
GarageArea
Numeric
Distinct count | 441 |
---|---|
Unique (%) | 30.2% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 472.98 |
---|---|
Minimum | 0 |
Maximum | 1418 |
Zeros (%) | 5.5% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 334.5 |
Median | 480 |
Q3 | 576 |
95-th percentile | 850.1 |
Maximum | 1418 |
Range | 1418 |
Interquartile range | 241.5 |
Descriptive statistics
Standard deviation | 213.8 |
---|---|
Coef of variation | 0.45204 |
Kurtosis | 0.90982 |
Mean | 472.98 |
MAD | 160.02 |
Skewness | 0.1798 |
Sum | 690550 |
Variance | 45713 |
Memory size | 0.0 B |
GarageCars
Numeric
Distinct count | 5 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 1.7671 |
---|---|
Minimum | 0 |
Maximum | 4 |
Zeros (%) | 5.5% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 1 |
Median | 2 |
Q3 | 2 |
95-th percentile | 3 |
Maximum | 4 |
Range | 4 |
Interquartile range | 1 |
Descriptive statistics
Standard deviation | 0.74732 |
---|---|
Coef of variation | 0.4229 |
Kurtosis | 0.21613 |
Mean | 1.7671 |
MAD | 0.58384 |
Skewness | -0.3422 |
Sum | 2580 |
Variance | 0.55848 |
Memory size | 0.0 B |
GarageCond
Categorical
Distinct count | 6 |
---|---|
Unique (%) | 0.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
TA | |
---|---|
NA | 81 |
Fa | 35 |
Other values (3) | 18 |
Value | Count | Frequency (%) | |
TA | 1326 | 90.8% | |
NA | 81 | 5.5% | |
Fa | 35 | 2.4% | |
Gd | 9 | 0.6% | |
Po | 7 | 0.5% | |
Ex | 2 | 0.1% |
GarageFinish
Categorical
Distinct count | 4 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Unf | |
---|---|
RFn | |
Fin |
Value | Count | Frequency (%) | |
Unf | 605 | 41.4% | |
RFn | 422 | 28.9% | |
Fin | 352 | 24.1% | |
NA | 81 | 5.5% |
GarageQual
Categorical
Distinct count | 6 |
---|---|
Unique (%) | 0.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
TA | |
---|---|
NA | 81 |
Fa | 48 |
Other values (3) | 20 |
Value | Count | Frequency (%) | |
TA | 1311 | 89.8% | |
NA | 81 | 5.5% | |
Fa | 48 | 3.3% | |
Gd | 14 | 1.0% | |
Po | 3 | 0.2% | |
Ex | 3 | 0.2% |
GarageType
Categorical
Distinct count | 7 |
---|---|
Unique (%) | 0.5% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Attchd | |
---|---|
Detchd | |
BuiltIn | 88 |
Other values (4) | 115 |
Value | Count | Frequency (%) | |
Attchd | 870 | 59.6% | |
Detchd | 387 | 26.5% | |
BuiltIn | 88 | 6.0% | |
NA | 81 | 5.5% | |
Basment | 19 | 1.3% | |
CarPort | 9 | 0.6% | |
2Types | 6 | 0.4% |
GarageYrBlt
Categorical
Distinct count | 98 |
---|---|
Unique (%) | 6.7% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
NA | 81 |
---|---|
2005 | 65 |
2006 | 59 |
Other values (95) |
Value | Count | Frequency (%) | |
NA | 81 | 5.5% | |
2005 | 65 | 4.5% | |
2006 | 59 | 4.0% | |
2004 | 53 | 3.6% | |
2003 | 50 | 3.4% | |
2007 | 49 | 3.4% | |
1977 | 35 | 2.4% | |
1998 | 31 | 2.1% | |
1999 | 30 | 2.1% | |
1976 | 29 | 2.0% | |
2008 | 29 | 2.0% | |
2000 | 27 | 1.8% | |
1968 | 26 | 1.8% | |
2002 | 26 | 1.8% | |
1950 | 24 | 1.6% | |
1993 | 22 | 1.5% | |
1958 | 21 | 1.4% | |
1965 | 21 | 1.4% | |
1962 | 21 | 1.4% | |
2009 | 21 | 1.4% | |
Other values (78) | 740 | 50.7% |
GrLivArea
Numeric
Distinct count | 861 |
---|---|
Unique (%) | 59.0% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 1515.5 |
---|---|
Minimum | 334 |
Maximum | 5642 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 334 |
---|---|
5-th percentile | 848 |
Q1 | 1129.5 |
Median | 1464 |
Q3 | 1776.8 |
95-th percentile | 2466.1 |
Maximum | 5642 |
Range | 5308 |
Interquartile range | 647.25 |
Descriptive statistics
Standard deviation | 525.48 |
---|---|
Coef of variation | 0.34675 |
Kurtosis | 4.8743 |
Mean | 1515.5 |
MAD | 397.32 |
Skewness | 1.3652 |
Sum | 2212600 |
Variance | 276130 |
Memory size | 0.0 B |
HalfBath
Numeric
Distinct count | 3 |
---|---|
Unique (%) | 0.2% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 0.38288 |
---|---|
Minimum | 0 |
Maximum | 2 |
Zeros (%) | 62.5% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 1 |
95-th percentile | 1 |
Maximum | 2 |
Range | 2 |
Interquartile range | 1 |
Descriptive statistics
Standard deviation | 0.50289 |
---|---|
Coef of variation | 1.3134 |
Kurtosis | -1.0773 |
Mean | 0.38288 |
MAD | 0.47886 |
Skewness | 0.6752 |
Sum | 559 |
Variance | 0.25289 |
Memory size | 0.0 B |
Heating
Categorical
Distinct count | 6 |
---|---|
Unique (%) | 0.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
GasA | |
---|---|
GasW | 18 |
Grav | 7 |
Other values (3) | 7 |
Value | Count | Frequency (%) | |
GasA | 1428 | 97.8% | |
GasW | 18 | 1.2% | |
Grav | 7 | 0.5% | |
Wall | 4 | 0.3% | |
OthW | 2 | 0.1% | |
Floor | 1 | 0.1% |
HeatingQC
Categorical
Distinct count | 5 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Ex | |
---|---|
TA | |
Gd | |
Other values (2) | 50 |
Value | Count | Frequency (%) | |
Ex | 741 | 50.8% | |
TA | 428 | 29.3% | |
Gd | 241 | 16.5% | |
Fa | 49 | 3.4% | |
Po | 1 | 0.1% |
HouseStyle
Categorical
Distinct count | 8 |
---|---|
Unique (%) | 0.5% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
1Story | |
---|---|
2Story | |
1.5Fin | |
Other values (5) | 135 |
Value | Count | Frequency (%) | |
1Story | 726 | 49.7% | |
2Story | 445 | 30.5% | |
1.5Fin | 154 | 10.5% | |
SLvl | 65 | 4.5% | |
SFoyer | 37 | 2.5% | |
1.5Unf | 14 | 1.0% | |
2.5Unf | 11 | 0.8% | |
2.5Fin | 8 | 0.5% |
Id
Numeric
Distinct count | 1460 |
---|---|
Unique (%) | 100.0% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 730.5 |
---|---|
Minimum | 1 |
Maximum | 1460 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 1 |
---|---|
5-th percentile | 73.95 |
Q1 | 365.75 |
Median | 730.5 |
Q3 | 1095.2 |
95-th percentile | 1387 |
Maximum | 1460 |
Range | 1459 |
Interquartile range | 729.5 |
Descriptive statistics
Standard deviation | 421.61 |
---|---|
Coef of variation | 0.57715 |
Kurtosis | -1.2 |
Mean | 730.5 |
MAD | 365 |
Skewness | 0 |
Sum | 1066500 |
Variance | 177760 |
Memory size | 0.0 B |
KitchenAbvGr
Numeric
Distinct count | 4 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 1.0466 |
---|---|
Minimum | 0 |
Maximum | 3 |
Zeros (%) | 0.1% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 1 |
Q1 | 1 |
Median | 1 |
Q3 | 1 |
95-th percentile | 1 |
Maximum | 3 |
Range | 3 |
Interquartile range | 0 |
Descriptive statistics
Standard deviation | 0.22034 |
---|---|
Coef of variation | 0.21053 |
Kurtosis | 21.455 |
Mean | 1.0466 |
MAD | 0.090246 |
Skewness | 4.4838 |
Sum | 1528 |
Variance | 0.048549 |
Memory size | 0.0 B |
KitchenQual
Categorical
Distinct count | 4 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
TA | |
---|---|
Gd | |
Ex | 100 |
Value | Count | Frequency (%) | |
TA | 735 | 50.3% | |
Gd | 586 | 40.1% | |
Ex | 100 | 6.8% | |
Fa | 39 | 2.7% |
LandContour
Categorical
Distinct count | 4 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Lvl | |
---|---|
Bnk | 63 |
HLS | 50 |
Value | Count | Frequency (%) | |
Lvl | 1311 | 89.8% | |
Bnk | 63 | 4.3% | |
HLS | 50 | 3.4% | |
Low | 36 | 2.5% |
LandSlope
Categorical
Distinct count | 3 |
---|---|
Unique (%) | 0.2% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Gtl | |
---|---|
Mod | 65 |
Sev | 13 |
Value | Count | Frequency (%) | |
Gtl | 1382 | 94.7% | |
Mod | 65 | 4.5% | |
Sev | 13 | 0.9% |
LotArea
Numeric
Distinct count | 1073 |
---|---|
Unique (%) | 73.5% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 10517 |
---|---|
Minimum | 1300 |
Maximum | 215240 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 1300 |
---|---|
5-th percentile | 3311.7 |
Q1 | 7553.5 |
Median | 9478.5 |
Q3 | 11602 |
95-th percentile | 17401 |
Maximum | 215240 |
Range | 213940 |
Interquartile range | 4048 |
Descriptive statistics
Standard deviation | 9981.3 |
---|---|
Coef of variation | 0.94908 |
Kurtosis | 202.54 |
Mean | 10517 |
MAD | 3758.8 |
Skewness | 12.195 |
Sum | 15355000 |
Variance | 99626000 |
Memory size | 0.0 B |
LotConfig
Categorical
Distinct count | 5 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Inside | |
---|---|
Corner | |
CulDSac | 94 |
Other values (2) | 51 |
Value | Count | Frequency (%) | |
Inside | 1052 | 72.1% | |
Corner | 263 | 18.0% | |
CulDSac | 94 | 6.4% | |
FR2 | 47 | 3.2% | |
FR3 | 4 | 0.3% |
LotFrontage
Categorical
Distinct count | 111 |
---|---|
Unique (%) | 7.6% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
NA | |
---|---|
60 | 143 |
70 | 70 |
Other values (108) |
Value | Count | Frequency (%) | |
NA | 259 | 17.7% | |
60 | 143 | 9.8% | |
70 | 70 | 4.8% | |
80 | 69 | 4.7% | |
50 | 57 | 3.9% | |
75 | 53 | 3.6% | |
65 | 44 | 3.0% | |
85 | 40 | 2.7% | |
78 | 25 | 1.7% | |
90 | 23 | 1.6% | |
21 | 23 | 1.6% | |
64 | 19 | 1.3% | |
68 | 19 | 1.3% | |
24 | 19 | 1.3% | |
73 | 18 | 1.2% | |
55 | 17 | 1.2% | |
79 | 17 | 1.2% | |
63 | 17 | 1.2% | |
72 | 17 | 1.2% | |
100 | 16 | 1.1% | |
Other values (91) | 495 | 33.9% |
LotShape
Categorical
Distinct count | 4 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Reg | |
---|---|
IR1 | |
IR2 | 41 |
Value | Count | Frequency (%) | |
Reg | 925 | 63.4% | |
IR1 | 484 | 33.2% | |
IR2 | 41 | 2.8% | |
IR3 | 10 | 0.7% |
LowQualFinSF
Numeric
Distinct count | 24 |
---|---|
Unique (%) | 1.6% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 5.8445 |
---|---|
Minimum | 0 |
Maximum | 572 |
Zeros (%) | 98.2% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 0 |
95-th percentile | 0 |
Maximum | 572 |
Range | 572 |
Interquartile range | 0 |
Descriptive statistics
Standard deviation | 48.623 |
---|---|
Coef of variation | 8.3194 |
Kurtosis | 82.946 |
Mean | 5.8445 |
MAD | 11.481 |
Skewness | 9.0021 |
Sum | 8533 |
Variance | 2364.2 |
Memory size | 0.0 B |
MSSubClass
Numeric
Distinct count | 15 |
---|---|
Unique (%) | 1.0% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 56.897 |
---|---|
Minimum | 20 |
Maximum | 190 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 20 |
---|---|
5-th percentile | 20 |
Q1 | 20 |
Median | 50 |
Q3 | 70 |
95-th percentile | 160 |
Maximum | 190 |
Range | 170 |
Interquartile range | 50 |
Descriptive statistics
Standard deviation | 42.301 |
---|---|
Coef of variation | 0.74346 |
Kurtosis | 1.5707 |
Mean | 56.897 |
MAD | 31.283 |
Skewness | 1.4062 |
Sum | 83070 |
Variance | 1789.3 |
Memory size | 0.0 B |
MSZoning
Categorical
Distinct count | 5 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
RL | |
---|---|
RM | 218 |
FV | 65 |
Other values (2) | 26 |
Value | Count | Frequency (%) | |
RL | 1151 | 78.8% | |
RM | 218 | 14.9% | |
FV | 65 | 4.5% | |
RH | 16 | 1.1% | |
C (all) | 10 | 0.7% |
MasVnrArea
Categorical
Distinct count | 328 |
---|---|
Unique (%) | 22.5% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
0 | |
---|---|
108 | 8 |
180 | 8 |
Other values (325) |
Value | Count | Frequency (%) | |
0 | 861 | 59.0% | |
108 | 8 | 0.5% | |
180 | 8 | 0.5% | |
72 | 8 | 0.5% | |
NA | 8 | 0.5% | |
16 | 7 | 0.5% | |
120 | 7 | 0.5% | |
200 | 6 | 0.4% | |
106 | 6 | 0.4% | |
340 | 6 | 0.4% | |
80 | 6 | 0.4% | |
170 | 5 | 0.3% | |
84 | 5 | 0.3% | |
320 | 5 | 0.3% | |
360 | 5 | 0.3% | |
132 | 5 | 0.3% | |
216 | 4 | 0.3% | |
196 | 4 | 0.3% | |
76 | 4 | 0.3% | |
210 | 4 | 0.3% | |
Other values (308) | 488 | 33.4% |
MasVnrType
Categorical
Distinct count | 5 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
None | |
---|---|
BrkFace | |
Stone | 128 |
Other values (2) | 23 |
Value | Count | Frequency (%) | |
None | 864 | 59.2% | |
BrkFace | 445 | 30.5% | |
Stone | 128 | 8.8% | |
BrkCmn | 15 | 1.0% | |
NA | 8 | 0.5% |
MiscFeature
Categorical
Distinct count | 5 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
NA | |
---|---|
Shed | 49 |
Gar2 | 2 |
Other values (2) | 3 |
Value | Count | Frequency (%) | |
NA | 1406 | 96.3% | |
Shed | 49 | 3.4% | |
Gar2 | 2 | 0.1% | |
Othr | 2 | 0.1% | |
TenC | 1 | 0.1% |
MiscVal
Numeric
Distinct count | 21 |
---|---|
Unique (%) | 1.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 43.489 |
---|---|
Minimum | 0 |
Maximum | 15500 |
Zeros (%) | 96.4% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 0 |
95-th percentile | 0 |
Maximum | 15500 |
Range | 15500 |
Interquartile range | 0 |
Descriptive statistics
Standard deviation | 496.12 |
---|---|
Coef of variation | 11.408 |
Kurtosis | 698.6 |
Mean | 43.489 |
MAD | 83.88 |
Skewness | 24.452 |
Sum | 63494 |
Variance | 246140 |
Memory size | 0.0 B |
MoSold
Numeric
Distinct count | 12 |
---|---|
Unique (%) | 0.8% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 6.3219 |
---|---|
Minimum | 1 |
Maximum | 12 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 1 |
---|---|
5-th percentile | 2 |
Q1 | 5 |
Median | 6 |
Q3 | 8 |
95-th percentile | 11 |
Maximum | 12 |
Range | 11 |
Interquartile range | 3 |
Descriptive statistics
Standard deviation | 2.7036 |
---|---|
Coef of variation | 0.42766 |
Kurtosis | -0.40683 |
Mean | 6.3219 |
MAD | 2.1425 |
Skewness | 0.21184 |
Sum | 9230 |
Variance | 7.3096 |
Memory size | 0.0 B |
Neighborhood
Categorical
Distinct count | 25 |
---|---|
Unique (%) | 1.7% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
NAmes | |
---|---|
CollgCr | 150 |
OldTown | 113 |
Other values (22) |
Value | Count | Frequency (%) | |
NAmes | 225 | 15.4% | |
CollgCr | 150 | 10.3% | |
OldTown | 113 | 7.7% | |
Edwards | 100 | 6.8% | |
Somerst | 86 | 5.9% | |
Gilbert | 79 | 5.4% | |
NridgHt | 77 | 5.3% | |
Sawyer | 74 | 5.1% | |
NWAmes | 73 | 5.0% | |
SawyerW | 59 | 4.0% | |
BrkSide | 58 | 4.0% | |
Crawfor | 51 | 3.5% | |
Mitchel | 49 | 3.4% | |
NoRidge | 41 | 2.8% | |
Timber | 38 | 2.6% | |
IDOTRR | 37 | 2.5% | |
ClearCr | 28 | 1.9% | |
StoneBr | 25 | 1.7% | |
SWISU | 25 | 1.7% | |
MeadowV | 17 | 1.2% | |
Other values (5) | 55 | 3.8% |
OpenPorchSF
Numeric
Distinct count | 202 |
---|---|
Unique (%) | 13.8% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 46.66 |
---|---|
Minimum | 0 |
Maximum | 547 |
Zeros (%) | 44.9% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 25 |
Q3 | 68 |
95-th percentile | 175.05 |
Maximum | 547 |
Range | 547 |
Interquartile range | 68 |
Descriptive statistics
Standard deviation | 66.256 |
---|---|
Coef of variation | 1.42 |
Kurtosis | 8.4572 |
Mean | 46.66 |
MAD | 47.678 |
Skewness | 2.3619 |
Sum | 68124 |
Variance | 4389.9 |
Memory size | 0.0 B |
OverallCond
Numeric
Distinct count | 9 |
---|---|
Unique (%) | 0.6% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 5.5753 |
---|---|
Minimum | 1 |
Maximum | 9 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 1 |
---|---|
5-th percentile | 4 |
Q1 | 5 |
Median | 5 |
Q3 | 6 |
95-th percentile | 8 |
Maximum | 9 |
Range | 8 |
Interquartile range | 1 |
Descriptive statistics
Standard deviation | 1.1128 |
---|---|
Coef of variation | 0.19959 |
Kurtosis | 1.0985 |
Mean | 5.5753 |
MAD | 0.88902 |
Skewness | 0.69236 |
Sum | 8140 |
Variance | 1.2383 |
Memory size | 0.0 B |
OverallQual
Numeric
Distinct count | 10 |
---|---|
Unique (%) | 0.7% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 6.0993 |
---|---|
Minimum | 1 |
Maximum | 10 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 1 |
---|---|
5-th percentile | 4 |
Q1 | 5 |
Median | 6 |
Q3 | 7 |
95-th percentile | 8 |
Maximum | 10 |
Range | 9 |
Interquartile range | 2 |
Descriptive statistics
Standard deviation | 1.383 |
---|---|
Coef of variation | 0.22675 |
Kurtosis | 0.091857 |
Mean | 6.0993 |
MAD | 1.098 |
Skewness | 0.21672 |
Sum | 8905 |
Variance | 1.9127 |
Memory size | 0.0 B |
PavedDrive
Categorical
Distinct count | 3 |
---|---|
Unique (%) | 0.2% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Y | |
---|---|
N | 90 |
P | 30 |
Value | Count | Frequency (%) | |
Y | 1340 | 91.8% | |
N | 90 | 6.2% | |
P | 30 | 2.1% |
PoolArea
Numeric
Distinct count | 8 |
---|---|
Unique (%) | 0.5% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 2.7589 |
---|---|
Minimum | 0 |
Maximum | 738 |
Zeros (%) | 99.5% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 0 |
95-th percentile | 0 |
Maximum | 738 |
Range | 738 |
Interquartile range | 0 |
Descriptive statistics
Standard deviation | 40.177 |
---|---|
Coef of variation | 14.563 |
Kurtosis | 222.5 |
Mean | 2.7589 |
MAD | 5.4914 |
Skewness | 14.813 |
Sum | 4028 |
Variance | 1614.2 |
Memory size | 0.0 B |
PoolQC
Categorical
Distinct count | 4 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
NA | |
---|---|
Gd | 3 |
Ex | 2 |
Value | Count | Frequency (%) | |
NA | 1453 | 99.5% | |
Gd | 3 | 0.2% | |
Ex | 2 | 0.1% | |
Fa | 2 | 0.1% |
RoofMatl
Categorical
Distinct count | 8 |
---|---|
Unique (%) | 0.5% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
CompShg | |
---|---|
Tar&Grv | 11 |
WdShngl | 6 |
Other values (5) | 9 |
Value | Count | Frequency (%) | |
CompShg | 1434 | 98.2% | |
Tar&Grv | 11 | 0.8% | |
WdShngl | 6 | 0.4% | |
WdShake | 5 | 0.3% | |
Membran | 1 | 0.1% | |
ClyTile | 1 | 0.1% | |
Metal | 1 | 0.1% | |
Roll | 1 | 0.1% |
RoofStyle
Categorical
Distinct count | 6 |
---|---|
Unique (%) | 0.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Gable | |
---|---|
Hip | |
Flat | 13 |
Other values (3) | 20 |
Value | Count | Frequency (%) | |
Gable | 1141 | 78.2% | |
Hip | 286 | 19.6% | |
Flat | 13 | 0.9% | |
Gambrel | 11 | 0.8% | |
Mansard | 7 | 0.5% | |
Shed | 2 | 0.1% |
SaleCondition
Categorical
Distinct count | 6 |
---|---|
Unique (%) | 0.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Normal | |
---|---|
Partial | 125 |
Abnorml | 101 |
Other values (3) | 36 |
Value | Count | Frequency (%) | |
Normal | 1198 | 82.1% | |
Partial | 125 | 8.6% | |
Abnorml | 101 | 6.9% | |
Family | 20 | 1.4% | |
Alloca | 12 | 0.8% | |
AdjLand | 4 | 0.3% |
SalePrice
Numeric
Distinct count | 663 |
---|---|
Unique (%) | 45.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 180920 |
---|---|
Minimum | 34900 |
Maximum | 755000 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 34900 |
---|---|
5-th percentile | 88000 |
Q1 | 129980 |
Median | 163000 |
Q3 | 214000 |
95-th percentile | 326100 |
Maximum | 755000 |
Range | 720100 |
Interquartile range | 84025 |
Descriptive statistics
Standard deviation | 79443 |
---|---|
Coef of variation | 0.4391 |
Kurtosis | 6.5098 |
Mean | 180920 |
MAD | 57435 |
Skewness | 1.8809 |
Sum | 264140000 |
Variance | 6311100000 |
Memory size | 0.0 B |
SaleType
Categorical
Distinct count | 9 |
---|---|
Unique (%) | 0.6% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
WD | |
---|---|
New | 122 |
COD | 43 |
Other values (6) | 28 |
Value | Count | Frequency (%) | |
WD | 1267 | 86.8% | |
New | 122 | 8.4% | |
COD | 43 | 2.9% | |
ConLD | 9 | 0.6% | |
ConLI | 5 | 0.3% | |
ConLw | 5 | 0.3% | |
CWD | 4 | 0.3% | |
Oth | 3 | 0.2% | |
Con | 2 | 0.1% |
ScreenPorch
Numeric
Distinct count | 76 |
---|---|
Unique (%) | 5.2% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 15.061 |
---|---|
Minimum | 0 |
Maximum | 480 |
Zeros (%) | 92.1% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 0 |
95-th percentile | 160 |
Maximum | 480 |
Range | 480 |
Interquartile range | 0 |
Descriptive statistics
Standard deviation | 55.757 |
---|---|
Coef of variation | 3.7021 |
Kurtosis | 18.372 |
Mean | 15.061 |
MAD | 27.729 |
Skewness | 4.118 |
Sum | 21989 |
Variance | 3108.9 |
Memory size | 0.0 B |
Street
Categorical
Distinct count | 2 |
---|---|
Unique (%) | 0.1% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Pave | |
---|---|
Grvl | 6 |
Value | Count | Frequency (%) | |
Pave | 1454 | 99.6% | |
Grvl | 6 | 0.4% |
TotRmsAbvGrd
Numeric
Distinct count | 12 |
---|---|
Unique (%) | 0.8% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 6.5178 |
---|---|
Minimum | 2 |
Maximum | 14 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 2 |
---|---|
5-th percentile | 4 |
Q1 | 5 |
Median | 6 |
Q3 | 7 |
95-th percentile | 10 |
Maximum | 14 |
Range | 12 |
Interquartile range | 2 |
Descriptive statistics
Standard deviation | 1.6254 |
---|---|
Coef of variation | 0.24938 |
Kurtosis | 0.87364 |
Mean | 6.5178 |
MAD | 1.2796 |
Skewness | 0.67565 |
Sum | 9516 |
Variance | 2.6419 |
Memory size | 0.0 B |
TotalBsmtSF
Numeric
Distinct count | 721 |
---|---|
Unique (%) | 49.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 1057.4 |
---|---|
Minimum | 0 |
Maximum | 6110 |
Zeros (%) | 2.5% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 519.3 |
Q1 | 795.75 |
Median | 991.5 |
Q3 | 1298.2 |
95-th percentile | 1753 |
Maximum | 6110 |
Range | 6110 |
Interquartile range | 502.5 |
Descriptive statistics
Standard deviation | 438.71 |
---|---|
Coef of variation | 0.41488 |
Kurtosis | 13.201 |
Mean | 1057.4 |
MAD | 321.28 |
Skewness | 1.5227 |
Sum | 1543800 |
Variance | 192460 |
Memory size | 0.0 B |
Utilities
Categorical
Distinct count | 2 |
---|---|
Unique (%) | 0.1% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
AllPub | |
---|---|
NoSeWa | 1 |
Value | Count | Frequency (%) | |
AllPub | 1459 | 99.9% | |
NoSeWa | 1 | 0.1% |
WoodDeckSF
Numeric
Distinct count | 274 |
---|---|
Unique (%) | 18.8% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 94.245 |
---|---|
Minimum | 0 |
Maximum | 857 |
Zeros (%) | 52.1% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 168 |
95-th percentile | 335 |
Maximum | 857 |
Range | 857 |
Interquartile range | 168 |
Descriptive statistics
Standard deviation | 125.34 |
---|---|
Coef of variation | 1.3299 |
Kurtosis | 2.9786 |
Mean | 94.245 |
MAD | 102 |
Skewness | 1.5398 |
Sum | 137600 |
Variance | 15710 |
Memory size | 0.0 B |
YearBuilt
Numeric
Distinct count | 112 |
---|---|
Unique (%) | 7.7% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 1971.3 |
---|---|
Minimum | 1872 |
Maximum | 2010 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 1872 |
---|---|
5-th percentile | 1916 |
Q1 | 1954 |
Median | 1973 |
Q3 | 2000 |
95-th percentile | 2007 |
Maximum | 2010 |
Range | 138 |
Interquartile range | 46 |
Descriptive statistics
Standard deviation | 30.203 |
---|---|
Coef of variation | 0.015322 |
Kurtosis | -0.44215 |
Mean | 1971.3 |
MAD | 25.067 |
Skewness | -0.61283 |
Sum | 2878100 |
Variance | 912.22 |
Memory size | 0.0 B |
YearRemodAdd
Numeric
Distinct count | 61 |
---|---|
Unique (%) | 4.2% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 1984.9 |
---|---|
Minimum | 1950 |
Maximum | 2010 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 1950 |
---|---|
5-th percentile | 1950 |
Q1 | 1967 |
Median | 1994 |
Q3 | 2004 |
95-th percentile | 2007 |
Maximum | 2010 |
Range | 60 |
Interquartile range | 37 |
Descriptive statistics
Standard deviation | 20.645 |
---|---|
Coef of variation | 0.010401 |
Kurtosis | -1.272 |
Mean | 1984.9 |
MAD | 18.623 |
Skewness | -0.50304 |
Sum | 2897900 |
Variance | 426.23 |
Memory size | 0.0 B |
YrSold
Numeric
Distinct count | 5 |
---|---|
Unique (%) | 0.3% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 2007.8 |
---|---|
Minimum | 2006 |
Maximum | 2010 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 2006 |
---|---|
5-th percentile | 2006 |
Q1 | 2007 |
Median | 2008 |
Q3 | 2009 |
95-th percentile | 2010 |
Maximum | 2010 |
Range | 4 |
Interquartile range | 2 |
Descriptive statistics
Standard deviation | 1.3281 |
---|---|
Coef of variation | 0.00066146 |
Kurtosis | -1.1906 |
Mean | 2007.8 |
MAD | 1.1487 |
Skewness | 0.09617 |
Sum | 2931400 |
Variance | 1.7638 |
Memory size | 0.0 B |
This class makes a profile for a given dataframe and its different general features. Based on spark-df-profiling by Julio Soto.
This overview presents basic information about the DataFrame, like number of variable it has, how many are missing values and in which column, the types of each variable, also some statistical information that describes the variable plus a frequency plot. Also a table that specifies the existing datatypes in each column dataFrame and other features.
You can also use Spark's native describe
function to get something very similar of what you got using pandas.
df.describe().toPandas()
summary | Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | count | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | ... | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 |
1 | mean | 730.5 | 56.897260273972606 | None | 70.04995836802665 | 10516.828082191782 | None | None | None | None | ... | 2.758904109589041 | None | None | None | 43.489041095890414 | 6.321917808219178 | 2007.8157534246575 | None | None | 180921.19589041095 |
2 | stddev | 421.6100093688479 | 42.30057099381045 | None | 24.28475177448321 | 9981.26493237915 | None | None | None | None | ... | 40.17730694453021 | None | None | None | 496.1230244579441 | 2.7036262083595113 | 1.3280951205521145 | None | None | 79442.50288288663 |
3 | min | 1 | 20 | C (all) | 100 | 1300 | Grvl | Grvl | IR1 | Bnk | ... | 0 | Ex | GdPrv | Gar2 | 0 | 1 | 2006 | COD | Abnorml | 34900 |
4 | max | 1460 | 190 | RM | NA | 215245 | Pave | Pave | Reg | Lvl | ... | 738 | NA | NA | TenC | 15500 | 12 | 2010 | WD | Partial | 755000 |
5 rows × 82 columns
Your dataset had too many variables to wrap your head around, or even to print out nicely. This is just one of many situations where you'll want to access a smaller set of your data.
For now, we'll rely on our intuition to pick variables to focus on. Later tutorials will show you statistical techniques to automatically prioritize variables.
Before we can choose columns, it is helpful to see a list of all columns in the dataset. That is done with the columns property of the DataFrame (the bottom line of code below).
file_path = 'train.csv'
df = pd.read_csv(file_path)
print(df.columns)
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice'], dtype='object')
There are many ways to select a subset of your data. We'll start with two main approaches:
df.columns
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice'], dtype='object')
house_price = df.SalePrice
house_price.head()
0 208500 1 181500 2 223500 3 140000 4 250000 Name: SalePrice, dtype: int64
columns_of_interest = ['1stFlrSF', '2ndFlrSF']
two_columns_of_data = df[columns_of_interest]
two_columns_of_data
1stFlrSF | 2ndFlrSF | |
---|---|---|
0 | 856 | 854 |
1 | 1262 | 0 |
2 | 920 | 866 |
3 | 961 | 756 |
4 | 1145 | 1053 |
5 | 796 | 566 |
6 | 1694 | 0 |
7 | 1107 | 983 |
8 | 1022 | 752 |
9 | 1077 | 0 |
10 | 1040 | 0 |
11 | 1182 | 1142 |
12 | 912 | 0 |
13 | 1494 | 0 |
14 | 1253 | 0 |
15 | 854 | 0 |
16 | 1004 | 0 |
17 | 1296 | 0 |
18 | 1114 | 0 |
19 | 1339 | 0 |
20 | 1158 | 1218 |
21 | 1108 | 0 |
22 | 1795 | 0 |
23 | 1060 | 0 |
24 | 1060 | 0 |
25 | 1600 | 0 |
26 | 900 | 0 |
27 | 1704 | 0 |
28 | 1600 | 0 |
29 | 520 | 0 |
... | ... | ... |
1430 | 734 | 1104 |
1431 | 958 | 0 |
1432 | 968 | 0 |
1433 | 962 | 830 |
1434 | 1126 | 0 |
1435 | 1537 | 0 |
1436 | 864 | 0 |
1437 | 1932 | 0 |
1438 | 1236 | 0 |
1439 | 1040 | 685 |
1440 | 1423 | 748 |
1441 | 848 | 0 |
1442 | 1026 | 981 |
1443 | 952 | 0 |
1444 | 1422 | 0 |
1445 | 913 | 0 |
1446 | 1188 | 0 |
1447 | 1220 | 870 |
1448 | 796 | 550 |
1449 | 630 | 0 |
1450 | 896 | 896 |
1451 | 1578 | 0 |
1452 | 1072 | 0 |
1453 | 1140 | 0 |
1454 | 1221 | 0 |
1455 | 953 | 694 |
1456 | 2073 | 0 |
1457 | 1188 | 1152 |
1458 | 1078 | 0 |
1459 | 1256 | 0 |
1460 rows × 2 columns
two_columns_of_data.describe()
1stFlrSF | 2ndFlrSF | |
---|---|---|
count | 1460.000000 | 1460.000000 |
mean | 1162.626712 | 346.992466 |
std | 386.587738 | 436.528436 |
min | 334.000000 | 0.000000 |
25% | 882.000000 | 0.000000 |
50% | 1087.000000 | 0.000000 |
75% | 1391.250000 | 728.000000 |
max | 4692.000000 | 2065.000000 |
import optimus as op
tools = op.Utilities()
df = tools.read_csv("train.csv")
df.columns
['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice']
This is not the same in Spark, is more SQL-like, and the dot notation exists, but it will give you a Column
, not that useful for now, so let's better select it. This will create another Spark DF.
house_price = df.select("SalePrice")
Again not the same behavior, this will happen if you use head:
house_price.head()
Row(SalePrice=208500)
What? I know... But if you use the show()
method to see its content, if we do it with n=5
is the same as head in Pandas. If you want to know more about Rows
in spark go here: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Row
house_price.show(n=5)
+---------+ |SalePrice| +---------+ | 208500| | 181500| | 223500| | 140000| | 250000| +---------+ only showing top 5 rows
Woo!
We'll use again select()
.
two_columns_of_data = df.select("1stFlrSF","2ndFlrSF")
two_columns_of_data.show(n=2)
+--------+--------+ |1stFlrSF|2ndFlrSF| +--------+--------+ | 856| 854| | 1262| 0| +--------+--------+ only showing top 2 rows
or
columns_of_interest = ['1stFlrSF', '2ndFlrSF']
two_columns_of_data = df.select(columns_of_interest)
two_columns_of_data.show(n=2)
+--------+--------+ |1stFlrSF|2ndFlrSF| +--------+--------+ | 856| 854| | 1262| 0| +--------+--------+ only showing top 2 rows
two_columns_of_data.describe().show()
+-------+-----------------+------------------+ |summary| 1stFlrSF| 2ndFlrSF| +-------+-----------------+------------------+ | count| 1460| 1460| | mean|1162.626712328767|346.99246575342465| | stddev|386.5877380410744| 436.528435886257| | min| 334| 0| | max| 4692| 2065| +-------+-----------------+------------------+
And with the Optimus profiler (Remember clicking on Toggle Details):
profiler = op.DataFrameProfiler(two_columns_of_data)
profiler.profiler()
Dataset info
Number of variables | 2 |
---|---|
Number of observations | 1460 |
Total Missing (%) | 0.0% |
Total size in memory | 0.0 B |
Average record size in memory | 0.0 B |
Variables types
Numeric | 2 |
---|---|
Categorical | 0 |
Date | 0 |
Text (Unique) | 0 |
Rejected | 0 |
Warnings
2ndFlrSF
has 829 / 56.8% zeros1stFlrSF
Numeric
Distinct count | 753 |
---|---|
Unique (%) | 51.6% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 1162.6 |
---|---|
Minimum | 334 |
Maximum | 4692 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 334 |
---|---|
5-th percentile | 672.95 |
Q1 | 882 |
Median | 1087 |
Q3 | 1391.2 |
95-th percentile | 1831.2 |
Maximum | 4692 |
Range | 4358 |
Interquartile range | 509.25 |
Descriptive statistics
Standard deviation | 386.59 |
---|---|
Coef of variation | 0.33251 |
Kurtosis | 5.7221 |
Mean | 1162.6 |
MAD | 300.58 |
Skewness | 1.3753 |
Sum | 1697400 |
Variance | 149450 |
Memory size | 0.0 B |
2ndFlrSF
Numeric
Distinct count | 417 |
---|---|
Unique (%) | 28.6% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 346.99 |
---|---|
Minimum | 0 |
Maximum | 2065 |
Zeros (%) | 56.8% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 0 |
Q1 | 0 |
Median | 0 |
Q3 | 728 |
95-th percentile | 1141 |
Maximum | 2065 |
Range | 2065 |
Interquartile range | 728 |
Descriptive statistics
Standard deviation | 436.53 |
---|---|
Coef of variation | 1.258 |
Kurtosis | -0.55568 |
Mean | 346.99 |
MAD | 396.48 |
Skewness | 0.81219 |
Sum | 506610 |
Variance | 190560 |
Memory size | 0.0 B |
You have the code to load your data, and you know how to index it. You are ready to choose which column you want to predict. This column is called the prediction target. There is a convention that the prediction target is referred to as y. Here is an example doing that with the example data.
Check the code in the original repo before running this! https://www.kaggle.com/dansbecker/your-first-scikit-learn-model
Now it's time for you to define and fit a model for your data (in your notebook).
Select the target variable you want to predict. You can go back to the list of columns from your earlier commands to recall what it's called (hint: you've already worked with this variable). Save this to a new variable called y.
Create a list of the names of the predictors we will use in the initial model. Use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):
Using the list of variable names you just created, select a new DataFrame of the predictors data. Save this with the variable name X.
Create a DecisionTreeRegressorModel and save it to a variable (with a name like my_model or iowa_model). Ensure you've done the relevant import so you can run this command.
Fit the model you have created using the data in X and the target data you saved above.
Make a few predictions with the model's predict command and print out the predictions.
The varialbe we want to predict is the SalePrice. So:
import pandas as pd
file_path = 'train.csv'
df = pd.read_csv(file_path)
y = df.SalePrice
- LotArea
- YearBuilt
- 1stFlrSF
- 2ndFlrSF
- FullBath
- BedroomAbvGr
- TotRmsAbvGrd
predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF',
'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = df[predictors]
You will use the scikit-learn library to create your models. When coding, this library is written as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames. We will use Spark after :).
The steps to building and using a model are:
from sklearn.tree import DecisionTreeRegressor
# Define model
my_model = DecisionTreeRegressor()
# Fit model
my_model.fit(X, y)
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(my_model.predict(X.head()))
Making predictions for the following 5 houses: LotArea YearBuilt 1stFlrSF 2ndFlrSF FullBath BedroomAbvGr \ 0 8450 2003 856 854 2 3 1 9600 1976 1262 0 2 3 2 11250 2001 920 866 2 3 3 9550 1915 961 756 1 3 4 14260 2000 1145 1053 2 4 TotRmsAbvGrd 0 8 1 6 2 6 3 7 4 9 The predictions are [ 208500. 181500. 223500. 140000. 250000.]
The real prices are
y.head().tolist()
[208500, 181500, 223500, 140000, 250000]
So it's a very good model :). Or is it?
You've built a decision tree model that can predict the prices of houses based on their characteristics. It's natural to ask how accurate the model's predictions will be, and measuring accuracy is necessary for us to see whether or not other approaches improve our model.
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:
“Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API. This is majorly due to the org.apache.spark.ml Scala package name used by the DataFrame-based API, and the “Spark ML Pipelines” term we used initially to emphasize the pipeline concept.
The varialbe we want to predict is the SalePrice. So:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
file_path = 'train.csv'
df = spark.read.csv(file_path, header="true", inferSchema=True)
from pyspark.ml.feature import VectorAssembler, VectorIndexer
# Choosing predictors
features_cols = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF',
'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
# This concatenates all feature columns into a single feature vector in a new column "rawFeatures".
vectorAssembler = VectorAssembler(inputCols=features_cols, outputCol="raw_features")
# This identifies categorical features and indexes them.
vectorIndexer = VectorIndexer(inputCol="raw_features", outputCol="features", maxCategories=4)
from pyspark.ml.regression import DecisionTreeRegressor
# Takes the "features" column and learns to predict "SalePrice"
dt = DecisionTreeRegressor(labelCol="SalePrice")
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[
vectorAssembler,
vectorIndexer,
dt
])
model = pipeline.fit(df)
predictions = model.transform(df)
predictions.select("SalePrice","Prediction").show(5)
+---------+------------------+ |SalePrice| Prediction| +---------+------------------+ | 208500|219910.90697674418| | 181500|150758.37549407114| | 223500|219910.90697674418| | 140000|149397.38805970148| | 250000|300287.74358974356| +---------+------------------+ only showing top 5 rows
So it seems not that easy with Spark, and not even that accurate, but remember this will scale, and as you will see, spark needs more tweaks to improve performance. But more on that later.
You've built a model. But how good is it?
You'll need to answer this question for almost every model you ever build. In most (though not necessarily all) applications, the relevant measure of model quality is predictive accuracy. In other words, will the model's predictions be close to what actually happens.
Some people try answering this problem by making predictions with their training data. They compare those predictions to the actual target values in the training data. This approach has a critical shortcoming, which you will see in a moment (and which you'll subsequently see how to solve).
Even with this simple approach, you'll need to summarize the model quality into a form that someone can understand. If you have predicted and actual home values for 10000 houses, you will inevitably end up with a mix of good and bad predictions. Looking through such a long list would be pointless.
There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE). Let's break down this metric starting with the last word, error.
The prediction error for each house is: error=actual−predicted
So, if a house cost $150,000 and you predicted it would cost $100,000 the error is $50,000.
With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as
On average, our predictions are off by about X
In the notebook for this module they first load the Melbourne data and create X and y. We'll solve the problem first with sklearn and them with Spark :).
The measure they computed in the original notebook can be called an "in-sample" score. They used a single set of houses (called a data sample) for both building the model and for calculating it's MAE score. This is bad.
Imagine that, in the large real estate market, door color is unrelated to home price. However, in the sample of data you used to build the model, it may be that all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.
Since this pattern was originally derived from the training data, the model will appear accurate in the training data.
But this pattern likely won't hold when the model sees new data, and the model would be very inaccurate (and cost us lots of money) when we applied it to our real estate business.
Even a model capturing only happenstance relationships in the data, relationships that will not be repeated when new data, can appear to be very accurate on in-sample accuracy measurements.
Models' practical value come from making predictions on new data, so we should measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called validation data.
The scikit-learn library has a function train_test_split to break up the data into two pieces. We'll see afterwards how to do this in Spark.
import pandas as pd
file_path = 'train.csv'
df = pd.read_csv(file_path)
y = df.SalePrice
predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF',
'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = df[predictors]
from sklearn.model_selection import train_test_split
# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
It should be noted that ty default, the value of the test data is set to 0.25. And of course the training will be 0.75.
from sklearn.tree import DecisionTreeRegressor
# Define model
model = DecisionTreeRegressor()
# Fit model
model.fit(train_X, train_y)
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
# get predicted prices on validation data
val_predictions = model.predict(val_X)
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(val_y, val_predictions))
33606.2356164
The process in spark will be really similar, we first need to build the model as we saw before, but let's start with the data splitting.
In Spark randomSplit
randomly splits this DataFrame with the provided weights. The weigths are a list of doubles as weights with which to split the DataFrame. Weights will be normalized if they don’t sum up to 1.0. It's important to set a seed for reproducibility.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
file_path = 'train.csv'
df = spark.read.csv(file_path, header="true", inferSchema=True)
train, test = df.randomSplit(weights=[0.75,0.25],seed=27)
from pyspark.ml.feature import VectorAssembler, VectorIndexer
# Choosing predictors
features_cols = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF',
'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
# This concatenates all feature columns into a single feature vector in a new column "rawFeatures".
vectorAssembler = VectorAssembler(inputCols=features_cols, outputCol="raw_features")
# This identifies categorical features and indexes them.
vectorIndexer = VectorIndexer(inputCol="raw_features", outputCol="features", maxCategories=4)
from pyspark.ml.regression import DecisionTreeRegressor
# Takes the "features" column and learns to predict "SalePrice"
dt = DecisionTreeRegressor(labelCol="SalePrice")
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[
vectorAssembler,
vectorIndexer,
dt
])
# Creating a model with the training data
model = pipeline.fit(train)
# Making predictions with the test data
predictions = model.transform(test)
Let's see some of the predictions just to be sure we did a good job above
predictions.select("SalePrice","prediction").show(2)
+---------+----------+ |SalePrice|prediction| +---------+----------+ | 181500| 167027.0| | 250000| 290103.0| +---------+----------+ only showing top 2 rows
Now for the evaluation
from pyspark.ml.evaluation import RegressionEvaluator
# Select (prediction, true label) and compute test error with MAE
evaluator = RegressionEvaluator(
labelCol="SalePrice", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)
print("Mean Absolute Error (MAE) on test data= {}".format(mae))
Mean Absolute Error (MAE) on test data= 29025.42375836274
So now we see that using this method we got a better result than with sklearn. So this is a good start for Spark :).
Now that you have a trustworthy way to measure model accuracy, you can experiment with alternative models and see which gives the best predictions. But what alternatives do you have for models?
You can see in scikit-learn's documentation that the decision tree model has many options (more than you'll want or need for a long time). The most important options determine the tree's depth. Recall from page 2 that a trees depth is a measure of how many splits it makes before coming to a prediction. This is a relatively shallow tree
In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses and a leaf). As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have 210 groups of houses by the time we get to the 10th level. That's 1024 leaves.
When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).
This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.
At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.
Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting.
There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.
We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(predictors_train, targ_train)
preds_val = model.predict(predictors_val)
mae = mean_absolute_error(targ_val, preds_val)
return(mae)
Let's load the data again and split it:
import pandas as pd
file_path = 'train.csv'
df = pd.read_csv(file_path)
y = df.SalePrice
predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF',
'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = df[predictors]
from sklearn.model_selection import train_test_split
# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
We can use a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))
Max leaf nodes: 5 Mean Absolute Error: 35190 Max leaf nodes: 50 Mean Absolute Error: 27825 Max leaf nodes: 500 Mean Absolute Error: 32662 Max leaf nodes: 5000 Mean Absolute Error: 33382
We can see that in this case 50 is the optimal number of leaves (it has the lowest MAE).
Here's the takeaway: Models can suffer from either:
We use validation data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.
But we're still using Decision Tree models, which are not very sophisticated by modern machine learning standards.
We will use the maxDept
parameter to optimize because there's not a max_leaf_nodes in Spark.
Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
file_path = 'train.csv'
df = spark.read.csv(file_path, header="true", inferSchema=True)
train, test = df.randomSplit(weights=[0.75,0.25],seed=27)
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
def get_mae_spark(max_depth, train_data, test_data):
# This concatenates all feature columns into a single feature vector in a new column "rawFeatures".
vectorAssembler = VectorAssembler(inputCols=features_cols, outputCol="raw_features")
# This identifies categorical features and indexes them.
vectorIndexer = VectorIndexer(inputCol="raw_features", outputCol="features", maxCategories=4)
# Takes the "features" column and learns to predict "SalePrice"
dt = DecisionTreeRegressor(labelCol="SalePrice", maxDepth=max_depth)
pipeline = Pipeline(stages=[
vectorAssembler,
vectorIndexer,
dt
])
# Creating a model with the training data
model = pipeline.fit(train_data)
# Making predictions with the test data
predictions = model.transform(test_data)
# Select (prediction, true label) and compute test error with MAE
evaluator = RegressionEvaluator(labelCol="SalePrice", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)
return mae
# compare MAE with differing values of max_leaf_nodes
for max_depth in [5, 10, 15, 20]:
my_mae = get_mae_spark(max_depth, train, test)
print("Max depth: {} \t\t Mean Absolute Error: {}".format(max_depth, my_mae))
Max depth: 5 Mean Absolute Error: 29025.42375836274 Max depth: 10 Mean Absolute Error: 28375.668652125525 Max depth: 15 Mean Absolute Error: 30544.50034435261 Max depth: 20 Mean Absolute Error: 30883.51698806244
Here the best maxDepth is 10 :)