You will now have the possibility to practice all we have seen in this chapter, on a new dataset.
The dataset is called "Wine quality" and has been found and downloaded here: http://archive.ics.uci.edu/ml/datasets/Wine+Quality.
It contains the characteristics of 6497 wines, and their quality
, a grade between 0 and 10 given by 3 wine experts. On the website, the data is separated between white and red wine, but we have grouped them together for the purpose of the exercise. We added an additionnal attribute type
, containing the type of wine (red or white).
For the purpose of the exercise, and to train with missing values, some values have been deleted or modified from the original dataset.
# Import the functions for machine learning
%run 1-functions.ipynb
wine.csv
into the variable data
(2 points).# import dataset
import pandas as pd
# Begin answer
data = pd.read_csv('wine.csv')
# End answer
### BEGIN HIDDEN TESTS
assert not data.empty
### END HIDDEN TESTS
### BEGIN HIDDEN TESTS
assert not data['fixed acidity'].empty
### END HIDDEN TESTS
Understand the dataset: how it looks like, the different attributes, their distribution. Plot the distribution of the attributes.
head
(1 point).description
(1 point).type
contain? Place this number in the variable unique_type
(1 point).# examine dataset
# Begin answer
head = data.head()
description = data.describe()
unique_type = len(data['type'].unique())
# End answer
### BEGIN HIDDEN TESTS
assert head.equals(data.head())
### END HIDDEN TESTS
### BEGIN HIDDEN TESTS
assert description.equals(data.describe())
### END HIDDEN TESTS
### BEGIN HIDDEN TESTS
assert unique_type == len(data['type'].unique())
### END HIDDEN TESTS
missing_nb
(1 point).Think about how to deal with the missing values.
# detect missing values
# Begin answer
missing_nb = data['density'].isna().sum()
# End answer
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6497 entries, 0 to 6496 Data columns (total 13 columns): fixed acidity 6497 non-null float64 volatile acidity 6497 non-null float64 citric acid 6497 non-null float64 residual sugar 6497 non-null float64 chlorides 6497 non-null float64 free sulfur dioxide 6497 non-null float64 total sulfur dioxide 6497 non-null float64 density 5235 non-null float64 pH 6497 non-null float64 sulphates 6497 non-null float64 alcohol 6497 non-null float64 quality 6497 non-null int64 type 6497 non-null object dtypes: float64(11), int64(1), object(1) memory usage: 659.9+ KB
### BEGIN HIDDEN TESTS
assert missing_nb == data['density'].isna().sum()
### END HIDDEN TESTS
residual sugar
and alcohol
. Use this formula to recover the attribute: attribute = -0.0014 * alcohol + 0.0002 * residual sugar + 1.0082
(2 points).# recover the missing values
# Begin answer
data.loc[data['density'].isna(), 'density'] = -0.0014 * data['alcohol'] + 0.0002 * data['residual sugar'] + 1.0082
# End answer
### BEGIN HIDDEN TESTS
assert data.loc[data['density'].isna()].empty
assert data.iloc[6]['density'] == -0.0014 * data.iloc[6]['alcohol'] + 0.0002 * data.iloc[6]['residual sugar'] + 1.0082
### END HIDDEN TESTS
quality
with a regression model (you can use the functions we used along the chapter).x
and y
(2 points).mae_regression
(1 point).# predict 'quality' with regression
from sklearn.metrics import mean_absolute_error
# Begin answer
x = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide',
'density', 'pH', 'sulphates', 'alcohol']
y = ['quality']
predictions, ytest = knn_regression(data, x, y)
mae_regression = mean_absolute_error(predictions, ytest)
print('MAE: ' + str(mae_regression))
# End answer
MAE: 0.6105846153846155
### BEGIN HIDDEN TESTS
assert x == ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide',
'density', 'pH', 'sulphates', 'alcohol']
### END HIDDEN TESTS
### BEGIN HIDDEN TESTS
assert y == ['quality']
### END HIDDEN TESTS
### BEGIN HIDDEN TESTS
assert mae_regression > 0.6 and mae_regression < 0.7
### END HIDDEN TESTS
# plot predictions vs. true labels
import matplotlib.pyplot as plt
import numpy as np
# Begin answer
plt.figure(figsize = (12, 8))
pred = []
for element in predictions:
pred.append(element[0])
plt.plot(pred, ytest, 'x')
x = np.linspace(3, 8, 10)
plt.plot(x, x, color = 'black')
plt.xlabel('Prediction')
plt.ylabel('True label')
plt.title('Prediction vs. true label')
# End answer
Text(0.5, 1.0, 'Prediction vs. true label')
worst_comb
, it contains only one attribute (1 point).best_comb
, it contains 5 attributes (including fixed acidity
and volatile acidity
) (1 point).# find best and worst attribute combinations
# Begin answer
worst_comb = ['sulphates']
best_comb = ['fixed acidity', 'volatile acidity', 'pH', 'sulphates', 'alcohol']
# End answer
### BEGIN HIDDEN TESTS
assert worst_comb == ['sulphates']
### END HIDDEN TESTS
### BEGIN HIDDEN TESTS
assert best_comb == ['fixed acidity', 'volatile acidity', 'pH', 'sulphates', 'alcohol']
### END HIDDEN TESTS
quality
with a classification model, with all the attributes.mae_classification
(1 point).Think about which model is the best to use: regression or classification? Why? Look at the values of the variable quality
.
# predict 'quality' with classification
from sklearn.metrics import mean_absolute_error
# Begin answer
x = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide',
'density', 'pH', 'sulphates', 'alcohol']
y = ['quality']
predictions, ytest = knn_classification(data, x, y)
mae_classification = mean_absolute_error(predictions, ytest)
print('MAE: ' + str(mae_classification))
# End answer
MAE: 0.5884615384615385
C:\Users\Anna\Anaconda3\lib\site-packages\ipykernel_launcher.py:14: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
### BEGIN HIDDEN TESTS
assert mae_classification > 0.5 and mae_classification < 0.6
### END HIDDEN TESTS
quality
with the new datasets and the chosen type of task (classification or regression).mae_1
and mae_2
(2 points).# split the dataset and make predictions
from sklearn.metrics import mean_absolute_error
# Begin answer
x = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide',
'density', 'pH', 'sulphates', 'alcohol']
y = ['quality']
white_wine = data.loc[data['type'] == 'white']
predictions, ytest = knn_classification(white_wine, x, y)
mae_1 = mean_absolute_error(predictions, ytest)
print('MAE 1: ' + str(mae_1))
red_wine = data.loc[data['type'] == 'red']
predictions, ytest = knn_classification(red_wine, x, y)
mae_2 = mean_absolute_error(predictions, ytest)
print('MAE 2: ' + str(mae_2))
# End answer
MAE 1: 0.6642857142857143 MAE 2: 0.49375
C:\Users\Anna\Anaconda3\lib\site-packages\ipykernel_launcher.py:14: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). C:\Users\Anna\Anaconda3\lib\site-packages\ipykernel_launcher.py:14: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
### BEGIN HIDDEN TESTS
assert (mae_1 > 0.6 and mae_1 < 0.7) or (mae_2 > 0.6 and mae_2 < 0.7)
### END HIDDEN TESTS
### BEGIN HIDDEN TESTS
assert (mae_1 > 0.4 and mae_1 < 0.5) or (mae_2 > 0.4 and mae_2 < 0.5)
### END HIDDEN TESTS