Exercise 3.2

In [1]:
import numpy as np
import matplotlib.pyplot as plt

Linear Regression

The goal of this exercise is to explore a simple linear regression problem based on Portugese white wine.

The dataset is based on Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. Published in Decision Support Systems, Elsevier, 47(4):547-553, 2009.

In [3]:
# The code snippet below is responsible for downloading the dataset
# - for example when running via Google Colab.
#
# You can also directly download the file using the link if you work
# with a local setup (in that case, ignore the !wget)

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
/bin/sh: wget: command not found

Before we start

The downloaded file contains data on 4989 wines. For each wine 11 features are recorded (column 0 to 10). The final columns contains the quality of the wine. This is what we want to predict. More information on the features and the quality measurement is provided in the original publication.

List of columns/features:

  1. fixed acidity
  2. volatile acidity
  3. citric acid
  4. residual sugar
  5. chlorides
  6. free sulfur dioxide
  7. total sulfur dioxide
  8. density
  9. pH
  10. sulphates
  11. alcohol
  12. quality
In [ ]:
# Before working with the data, 
# we download and prepare all features

# load all examples from the file
data = np.genfromtxt('winequality-white.csv',delimiter=";",skip_header=1)

print("data:", data.shape)

# Prepare for proper training
np.random.shuffle(data) # randomly sort examples

# take the first 3000 examples for training
# (remember array slicing from last week)
X_train = data[:3000,:11] # all features except last column
y_train = data[:3000,11]  # quality column

# and the remaining examples for testing
X_test = data[3000:,:11] # all features except last column
y_test = data[3000:,11] # quality column

print("First example:")
print("Features:", X_train[0])
print("Quality:", y_train[0])
('data:', (4898, 12))
First example:
('Features:', array([7.600e+00, 3.800e-01, 2.800e-01, 4.200e+00, 2.900e-02, 7.000e+00,
       1.120e+02, 9.906e-01, 3.000e+00, 4.100e-01, 1.260e+01]))
('Quality:', 6.0)

Problems

  • First we want to understand the data better. Plot (plt.hist) the distribution of each of the features for the training data as well as the 2D distribution (either plt.scatter or plt.hist2d) of each feature versus quality. Also calculate the correlation coefficient (np.corrcoef) for each feature with quality. Which feature by itself seems most predictive for the quality?

  • Calculate the linear regression weights. Numpy provides functions for matrix multiplication (np.matmul), matrix transposition (.T) and matrix inversion (np.linalg.inv).

  • Use the weights to predict the quality for the test dataset. How does your predicted quality compare with the true quality of the test data? Calculate the correlation coefficient between predicted and true quality and draw a scatter plot.

Hints

Formally, we want to find weights $w_i$ that minimize: $$ \sum_{j}\left(\sum_{i} X_{i j} w_{i}-y_{j}\right)^{2} $$ The index $i$ denotes the different features (properties of the wines) while the index $j$ runs over the different wines. The matrix $X_{ij}$ contains the training data, $y_j$ is the 'true' quality for sample $j$. The weights can be found by taking the first derivative of the above expression with respect to the weights and setting it to zero (the standard strategy for finding an extremum), and solving the corresponding system of equations (for a detailed derivation, see here). The result is: $$ \overrightarrow{\mathbf{w}}=\left(\mathbf{X}^{T} \mathbf{X}\right)^{-1} \mathbf{X}^{T} \overrightarrow{\mathbf{y}} $$

In the end, you should have as many components of $w_i$ as there are features in the data (i.e. eleven in this case).

You can use .shape to inspect the dimensions of numpy tensors.