🎯 In this challenge, you will predict the sale price of houses (SalePrice
) according to the surface, the number of bedrooms or the overall quality.
Run the cell below to import
some Python libraries - these will be our tools for working with data 📊
import pandas as pd
import numpy as np
import seaborn as sns
👇 Run the cell below to load the house_prices.csv
dataset into this notebook as a pandas DataFrame
, and display its first 5 rows.
Note: the datasets has been cleaned and federated for learning purposes
houses = pd.read_csv('https://storage.googleapis.com/introduction-to-data-science/house-prices.csv')
houses.head()
This dataset contains information about houses sold.
The columns in the given dataset are as follows:
Features:
GrLivArea
: Surface in squared feetBedroomAbvGr
: Number of bedroomsKitchenAbvGr
: Number of kitchensOverallQual
: Overall quality (1: Very Poor / 10: Very Excellent)Target:
SalePrice
: Sale price in USDLet's start by understanding the data we have - how big is the dataset, what is the information (columns) we have and so on:
💡 Tip: remember to check the slides for the right methods ;)
# your code here
Now try to separate only some columns - say we only want to see SalePrice
, or GrLivArea
and BedroomAbvGr
:
# your code here
Let's follow some basic intuition - does the surface (GrLivArea
) affects the price of the house(SalePrice
)❓
Let's use a Seaborn Scatterplot - a method inside the Seaborn library (which we imported above and shortened to sns
) that gives us a graph with data points as dots with x
and y
values.
# your code here
Does the overall quality (OverallQual
) has an impact on the SalePrice
❓
💡Tip: You can add a hue
to the previous graph
# your code here
Let's also understand the repartition we have for some features:
Seaborn countplot
is here to help with that.
# your code here
1. First, let's create what will be our features and our target.
Create a variable features
containing all features:
# your code here
Create a variable target
containing the target:
# your code here
Feel free to check what is in your features
and target
below:
# your code here
2. Time to import the sklearn function to split our dataset into a train and a test set
Try to find the right function here
# your code here
3. Use this function to create X_train, X_test, y_train, y_test
🚨 Set random_state=42
as an argument of the function.
# your code here
Let's check what is in your X_train
, X_test
, y_train
, y_test
:
X_train
and X_test
?# your code here
4. Time to import the Linear Regression model
Python libraries like Scikit-learn make it super easy for people getting into Data Science and ML to experiment.
The code is already in the library, it's just about calling the right methods! 🛠
# your code here
Now to initialize the model. Store it in a variable model
:
# your code here
5. Train the model on the training set.
This is the process where the Linear Regression model looks for a line that best fits all the points in the dataset. This is the part where the computer is hard at work learning! 🤖
# your code here
6. Evaluate the performance of the model on the test set.
Models can have different default scoring metrics. Linear Regression by default uses something called R-squared
- a metric that shows how much of change in the target (SalePrice
) can be explained by the changes in features (GrLivArea
, BedroomAbvGr
, KitchenAbvGr
and OverallQual
)
# your code here
⚠️ Careful not to confuse this with accuracy. The above number is shows that "the inputs we have can help us predict this percentage of change in the depreciation" Which is decent considering we did with just a few lines of code!
Let's compare this score to the one the model gets on the training set:
# your code here
👉 You should get a slightly higher score on the training set, which is to be expected in general.
The good news is that the 2 scores are relatively close to each other, which shows that we achieved a good balance, our model generalises well to new observations, explaining more than 70% of change in depreciation.
Splitting the dataset into a training set and a test set is essential in Machine Learning. It allows us to identify:
In our case, we have a robust model that does well on new observations💪. We can now use it to make predictions on new houses with confidence.
7. Let's predict the price of a new house 🔮
This new house has a the following characteristics:
7.1 Start by creating variable new_house
in which you will store those characteristics. Make sure to use the right format to be able to make a prediction.
Note: here is a reminder of the columns in the table: ['GrLivArea', 'BedroomAbvGr', 'KitchenAbvGr', 'OverallQual']
`new_house` should be a `list of list`: [[surface, nb_bedrooms, nb_kitchens, overall_quality]]
# your code here
7.2 Now use the right method to make a prediction using the model we just trained:
# your code here
Now let's say we have another house with the same characteristics, except for the overall quality score being 9.
What would be the price of this house❓
# your code here
8. Explaining the model
Linear Regression is a linear model, so it's explainability is quite high.
8.1. We can check the coef_
or the coefficients of the model. These explain how much the target (SalePrice
) changes with a change of 1
in each of the features (inputs), while holding other features constant.
# your code here
🤔 We'd need to check the column order again, to know which number is which input. But, we got you covered! Run the cell below:
pd.concat([pd.DataFrame(features.columns),pd.DataFrame(np.transpose(model.coef_))], axis = 1)
8.2 The other thing we can check is the intercept of the model. This is the target (SalePrice
) for when all inputs are 0. So this should be close to a new house with a surface of 0 squared feet, no bedrooms, no kitchens and an overall quality of 0:
# your code here