import numpy as np
import matplotlib.pyplot as plt
Linear Regression
The goal of this week's exercise is to explore a simple linear regression problem based on Portugese white wine.
The dataset is based on Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. Published in Decision Support Systems, Elsevier, 47(4):547-553, 2009.
# The code snippet below is responsible for downloading the dataset to
# Google. You can directly download the file using the link
# if you work with a local anaconda setup
# Temporarily replaced link as the ML dataset archive seems to be down
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
!wget https://raw.githubusercontent.com/zygmuntz/wine-quality/master/winequality/winequality-white.csv
--2021-05-10 08:16:34-- https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252 Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:17:07-- (try: 2) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:17:41-- (try: 3) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:18:16-- (try: 4) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:18:52-- (try: 5) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:19:28-- (try: 6) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:20:06-- (try: 7) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:20:45-- (try: 8) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:21:25-- (try: 9) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:22:06-- (try:10) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:22:48-- (try:11) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:23:30-- (try:12) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:24:12-- (try:13) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:24:54-- (try:14) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:25:36-- (try:15) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:26:18-- (try:16) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:27:00-- (try:17) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:27:42-- (try:18) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:28:24-- (try:19) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Retrying. --2021-05-10 08:29:06-- (try:20) https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... failed: Connection timed out. Giving up.
Before we start
The downloaded file contains data on 4989 wines. For each wine 11 features are recorded (column 0 to 10). The final columns contains the quality of the wine. This is what we want to predict.
List of columns/features: 0. fixed acidity
# load all examples from the file
data = np.genfromtxt('winequality-white.csv',delimiter=";",skip_header=1)
print("data:", data.shape)
# Prepare for proper training
np.random.shuffle(data) # randomly sort examples
# take the first 3000 examples for training
# (remember array slicing from last week)
X_train = data[:3000,:11] # all features except last column
y_train = data[:3000,11] # quality column
# and the remaining examples for testing
X_test = data[3000:,:11] # all features except last column
y_test = data[3000:,11] # quality column
print("First example:")
print("Features:", X_train[0])
print("Quality:", y_train[0])
('data:', (4898, 12)) First example: ('Features:', array([7.600e+00, 3.800e-01, 2.800e-01, 4.200e+00, 2.900e-02, 7.000e+00, 1.120e+02, 9.906e-01, 3.000e+00, 4.100e-01, 1.260e+01])) ('Quality:', 6.0)
plt.hist
) the distribution of each of the features for the training data as well as the 2D distribution (either plt.scatter
or plt.hist2d
) of each feature versus quality. Also calculate the correlation coefficient (np.corrcoef
) for each feature with quality. Which feature by itself seems mostpredictive for the quality?
Calculate the linear regression weights as derived in the lecture. Numpy provides functions for matrix multiplication (np.matmul
), matrix transposition (.T
) and matrix inversion (np.linalg.inv
).
Use the weights to predict the quality for the test dataset. How does your predicted quality compare with the true quality of the test data? Calculate the correlation coefficient between predicted and true quality and draw the scatter plot.
x = np.random.uniform(size=(3,4))
x
array([[0.27061972, 0.85093187, 0.06038869, 0.6430975 ], [0.05802941, 0.1492127 , 0.93073299, 0.70555297], [0.4806267 , 0.27201085, 0.75607278, 0.88637951]])
x[1,1]
0.14921269768865764
f = x[1:,2:]
print(f)
[[0.93073299 0.70555297] [0.75607278 0.88637951]]
f[0,0] = 999
print(f)
[[9.99000000e+02 7.05552973e-01] [7.56072781e-01 8.86379512e-01]]
x
array([[2.70619720e-01, 8.50931871e-01, 6.03886907e-02, 6.43097505e-01], [5.80294054e-02, 1.49212698e-01, 9.99000000e+02, 7.05552973e-01], [4.80626701e-01, 2.72010854e-01, 7.56072781e-01, 8.86379512e-01]])