Predicting House Price 🏠¶

🎯 In this challenge, you will predict the sale price of houses (SalePrice) according to the surface, the number of bedrooms or the overall quality.

Load Python Libraries¶

Run the cell below to import some Python libraries - these will be our tools for working with data 📊

In [ ]:

import pandas as pd
import numpy as np
import seaborn as sns

Load the Dataset¶

👇 Run the cell below to load the house_prices.csv dataset into this notebook as a pandas DataFrame, and display its first 5 rows.

Note: the datasets has been cleaned and federated for learning purposes

In [ ]:

houses = pd.read_csv('https://storage.googleapis.com/introduction-to-data-science/house-prices.csv')
houses.head()

This dataset contains information about houses sold.

The columns in the given dataset are as follows:

Features:

GrLivArea: Surface in squared feet
BedroomAbvGr: Number of bedrooms
KitchenAbvGr: Number of kitchens
OverallQual: Overall quality (1: Very Poor / 10: Very Excellent)

Target:

SalePrice: Sale price in USD

We can get a lot of insight without ML! 🤔¶

Your turn! 🚀¶

Let's start by understanding the data we have - how big is the dataset, what is the information (columns) we have and so on:

💡 Tip: remember to check the slides for the right methods ;)

In [ ]:

# your code here

Now try to separate only some columns - say we only want to see SalePrice, or GrLivArea and BedroomAbvGr:

In [ ]:

# your code here

Your turn - Now let's do some visualization 📊.¶

Let's follow some basic intuition - does the surface (GrLivArea) affects the price of the house(SalePrice)❓

Let's use a Seaborn Scatterplot - a method inside the Seaborn library (which we imported above and shortened to sns) that gives us a graph with data points as dots with x and y values.

In [ ]:

# your code here

Does the overall quality (OverallQual) has an impact on the SalePrice ❓

💡Tip: You can add a hue to the previous graph

In [ ]:

# your code here

Let's also understand the repartition we have for some features:

What is the repartition of the Number of bedrooms❓
What is the repartition of the Number of kitchens❓

Seaborn countplot is here to help with that.

In [ ]:

# your code here

Your first model - Linear Regression 📈¶

1. First, let's create what will be our features and our target.

Create a variable features containing all features:

In [ ]:

# your code here

Create a variable target containing the target:

In [ ]:

# your code here

Feel free to check what is in your features and target below:

In [ ]:

# your code here

2. Time to import the sklearn function to split our dataset into a train and a test set

Try to find the right function here

In [ ]:

# your code here

3. Use this function to create X_train, X_test, y_train, y_test

🚨 Set random_state=42 as an argument of the function.

In [ ]:

# your code here

Let's check what is in your X_train, X_test, y_train, y_test:

What percentage of the observations were allocated to the train and the test set?
How many features in X_train and X_test?

In [ ]:

# your code here

4. Time to import the Linear Regression model

Python libraries like Scikit-learn make it super easy for people getting into Data Science and ML to experiment.

The code is already in the library, it's just about calling the right methods! 🛠

In [ ]:

# your code here

Now to initialize the model. Store it in a variable model:

In [ ]:

# your code here

5. Train the model on the training set.

This is the process where the Linear Regression model looks for a line that best fits all the points in the dataset. This is the part where the computer is hard at work learning! 🤖

In [ ]:

# your code here

6. Evaluate the performance of the model on the test set.

Models can have different default scoring metrics. Linear Regression by default uses something called R-squared - a metric that shows how much of change in the target (SalePrice) can be explained by the changes in features (GrLivArea, BedroomAbvGr, KitchenAbvGr and OverallQual)

In [ ]:

# your code here

⚠️ Careful not to confuse this with accuracy. The above number is shows that "the inputs we have can help us predict this percentage of change in the depreciation" Which is decent considering we did with just a few lines of code!

Let's compare this score to the one the model gets on the training set:

In [ ]:

# your code here

👉 You should get a slightly higher score on the training set, which is to be expected in general.

The good news is that the 2 scores are relatively close to each other, which shows that we achieved a good balance, our model generalises well to new observations, explaining more than 70% of change in depreciation.

Splitting the dataset into a training set and a test set is essential in Machine Learning. It allows us to identify:

Overfitting: we would see a large difference between the 2 scores. The model would be very good on the data it trained on, but would be doing poorly on the test set.
Underfitting: we would have bad score on both the training data and on the test data. In this case, a reason could be that the model is not complex enough to capture the patterns in the data.

In our case, we have a robust model that does well on new observations💪. We can now use it to make predictions on new houses with confidence.

7. Let's predict the price of a new house 🔮

This new house has a the following characteristics:

Surface of 3,000 squared feet
3 bedrooms
1 kitchen
Overall quality score of 5

7.1 Start by creating variable new_house in which you will store those characteristics. Make sure to use the right format to be able to make a prediction.

Note: here is a reminder of the columns in the table: ['GrLivArea', 'BedroomAbvGr', 'KitchenAbvGr', 'OverallQual']

💡Hint

`new_house` should be a `list of list`:
    [[surface, nb_bedrooms, nb_kitchens, overall_quality]]

In [ ]:

# your code here

7.2 Now use the right method to make a prediction using the model we just trained:

In [ ]:

# your code here

Now let's say we have another house with the same characteristics, except for the overall quality score being 9.

What would be the price of this house❓

In [ ]:

# your code here

8. Explaining the model

Linear Regression is a linear model, so it's explainability is quite high.

8.1. We can check the coef_ or the coefficients of the model. These explain how much the target (SalePrice) changes with a change of 1 in each of the features (inputs), while holding other features constant.

In [ ]:

# your code here

🤔 We'd need to check the column order again, to know which number is which input. But, we got you covered! Run the cell below:

In [ ]:

pd.concat([pd.DataFrame(features.columns),pd.DataFrame(np.transpose(model.coef_))], axis = 1)

8.2 The other thing we can check is the intercept of the model. This is the target (SalePrice) for when all inputs are 0. So this should be close to a new house with a surface of 0 squared feet, no bedrooms, no kitchens and an overall quality of 0:

In [ ]:

# your code here