mlcourse.ai – Open Machine Learning Course¶

Individual Project¶

Predicting Wine Expert Rating¶

Author: Maxim Klyuchnikov

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
import plotly.offline as py
import warnings
import pycountry
from statsmodels.graphics.gofplots import qqplot
from wordcloud import WordCloud, STOPWORDS

warnings.filterwarnings('ignore')

import re
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix, hstack
from yellowbrick.model_selection import ValidationCurve, LearningCurve

py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

RANDOM_SEED=17

Project Description¶

Dataset¶

The data is taken from the Kaggle dataset https://www.kaggle.com/zynicide/wine-reviews/home, which in turn was scraped by the dataset's author from https://www.winemag.com/
There are lot of reviews from differents experts for the wines from the whole world. Also, some wine-specific information is also provided as a part of the dataset.
Dataset consists of the following fields (per info from https://github.com/zackthoutt/wine-deep-learning):

Features¶

Points: the number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80)
Title: the title of the wine review, which often contains the vintage if you're interested in extracting that feature
Variety: the type of grapes used to make the wine (ie Pinot Noir)
Description: a few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.
Country: the country that the wine is from
Province: the province or state that the wine is from
Region 1: the wine growing area in a province or state (ie Napa)
Region 2: sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank
Winery: the winery that made the wine
Designation: the vineyard within the winery where the grapes that made the wine are from
Price: the cost for a bottle of the wine, in US$
Taster Name: name of the person who tasted and reviewed the wine
Taster Twitter Handle: Twitter handle for the person who tasted ane reviewed the wine

Target¶

We have wine rating (Points) as a target. Reviewers from the original site provide rating for the wines varying from 80 to 100, here is the details of different ranges:

Range	Mark	Description
98–100	Classic	The pinnacle of quality
94–97	Superb	A great achievement
90–93	Excellent	Highly recommended
87–89	Very	Often good value; well recommended
83–86	Good	Suitable for everyday consumption; often good value
80–82	Acceptable	Can be employed in casual, less-critical circumstances

Our goal and possible applications¶

Originally, dataset author collected the data to create a predictive model to identify wines through blind tasting like a master sommelier would. Here we will try to solve simpler, yet useful in real life, task: predict the wine rating based on the wine features and words used in its review. This can have the following practical applications:

Understanding the unrated wine quality¶

Unlike other beverages, wines comes in overwhelming variety: it's about 10k grapes exists (and their number is growing), they can be blended in different proportions, the grape collection year and growing conditions comes into play, the wine may be seasoned for different amount of time in different types of barrels, etc, etc.

So review of the specific wine or lists like "top 10 wines of the season" doesn't make any sense - if you go to 2 different local stores there is a good chance you won't find the same wine in both of them. Finding the specific wine may require journey to another city or even country :) In such conditions it's worth to have a model which may predict the wine quality without having an exact rating given by the expert, but based on the wine features which you can get from the bottle.

While this is an area of purely personal taste, professionals always try to become free from the biases and provide objective observations. Blind testing may allow to find the biases of the specific reviewer.
Actually, the model could be used for cross-validation of the expert ratings :)

Data Analysis and Cleaning¶

Let's download the data from Kaggle, extract them into data folder and check the main properties of the resulting DataFrame:

In [ ]:

df = pd.read_csv('data/winemag-data-130k-v2.csv', index_col=0)
df.info(memory_usage='deep')

As we can see, there are many null values in the data, we need to deal with them later.

In [ ]:

df.head()

Let's check the data for possible categorical features:

In [ ]:

df.nunique()

Looks like the following features can be represented as categorical:

designation
province
region_1
region_2
taster_name
taster_twitter_handle
variety
winery

Let's explore the data now to get acquainted to the dataset more closely:

Country¶

In [ ]:

plt.figure(figsize=(13, 10))
ax = sns.countplot(y=df.country, order=df.country.value_counts().index, palette='tab10')
for p, label in zip(ax.patches, df.country.value_counts()):
    ax.annotate("{0:,d}".format(label), (p.get_width() + 50, p.get_y() + 0.7))
ax.set_title('Number of wine reviews per country', fontsize=18);

We see that we have a lot of the reviews for the wines from US, which can be explained by the fact that the reviewers are mostly located in the US.
Also it should be noted that we have countries with less number of reviews which may cause problems.

Let's see how the countries are distributed on the map, along with the number if review in them.
For the Choropleth to display the coloring in a more understantable way, let's log1p-transform the number of reviews per country:

In [ ]:

countries = df.groupby('country').size().reset_index()
countries.columns = ['name', 'size']
countries.name = countries.name.replace({ # making the country names compatible with pycountry
    'England': 'United Kingdom',
    'Czech Republic': 'United Kingdom',
    'Macedonia': 'Macedonia, Republic of',
    'Moldova': 'Moldova, Republic of',
    'US': 'United States'
})

data = pd.DataFrame(index=countries.index)
data['name'] = countries.name
data['size'] = countries['size']
data['code'] = countries.apply(lambda x: pycountry.countries.get(name=x['name']), axis=1)
data['code'] = data.code.apply(lambda x: x.alpha_3 if x else None)
data = data.dropna()

choropleth_data = [dict(
    type='choropleth',
    locations=data['code'],
    z=np.log1p(data['size']),
    #showscale=False,
    text=data['name'],
    marker=dict(
        line=dict(
            color='rgb(180,180,180)',
            width=0.5
        )),
)]

layout = dict(
    title='Number of wine reviews per country, log-transformed',
    geo=dict(
        showframe=False,
        showcoastlines=True,
        projection=dict(
            type='natural earth'
        )
    ))

fig = dict(data=choropleth_data, layout=layout)
py.iplot(fig, validate=False)

In [ ]:

top_rated_countries = df[['country', 'points']].groupby('country').mean().reset_index().sort_values('points', ascending=False).country[:10]

data = df[df.country.isin(top_rated_countries)]

plt.figure(figsize=(15, 7))
ax = sns.violinplot(x='country', y='points', data=data, order=top_rated_countries, palette='tab10')
ax.set_title('Top 10 countries with highest average rating', fontsize=18);

Here we can see that the some of the countries with low number of reviews has pretty high average rating.
Probably, it's because wines with the highest potential rating are the first to be reviewed by the experts.
The dependency between the Country and Points is clear.

Cleaning and transforming¶

Countries with less number of reviews does not have too much predictive power and introduce unnecessary noise, so let's replace them with the name 'Other' instead:

In [ ]:

vc = df.country.value_counts()
df['trans_country'] = df.country.replace(vc[vc < 100].index, 'Other')

In [ ]:

top_rated_countries = df[['trans_country', 'points']].groupby('trans_country').mean().reset_index().sort_values('points', ascending=False).trans_country[:10]

top_rated_countries_data = df[df.trans_country.isin(top_rated_countries)]

plt.figure(figsize=(15, 7))
ax = sns.violinplot(x='trans_country', y='points', data=top_rated_countries_data, order=top_rated_countries, palette='tab10')
ax.set_title('Top 10 countries with highest average rating', fontsize=18);

Now see better distribution of the rating among countries in the top 10 list.

Province, Region 1 and Region 2¶

These features are actually parts of the wine location hierarchy, so they better be joined into one field with the Country. Let's take a look at them:

In [ ]:

df[['trans_country', 'province', 'region_1', 'region_2']].head()

In [ ]:

df[['trans_country', 'province', 'region_1', 'region_2']].nunique()

In [ ]:

print('Countries with Region 2:', df[~df.region_2.isna()].trans_country.unique())

Looks like Region 2 is a US-specific feature, but it won't hurt if we include it as well, so we get better categorization for US wines.

In [ ]:

df['location'] = df.apply(lambda x: ' / '.join([y for y in [str(x['trans_country']), str(x['province']), str(x['region_1']), str(x['region_2'])] if y != 'nan']), axis=1)
df.location.head()

Now let's try to see if there is a dependency between the Points and Location:

In [ ]:

df_top_locations = df[df.location.isin(df.location.value_counts().index[:10])]

plt.figure(figsize=(12, 10))
ax = sns.violinplot(y='location', x='points', data=df_top_locations, palette='tab10');
ax.set_title('Wine rating distribution over top 10 locations with highest average rating', fontsize=18);

Cleaning and transforming¶

Let's see if we can get something from the title:

In [ ]:

df[['region_1', 'title']].head()

As we can see, some regions are repeated in title and even if region is NaN, it is possible to fill it with the value from the title, so let's do it:

In [ ]:

def extract_region_1(row):
    if row.region_1 == 'nan':
        return row.region_1
    if not row.title.endswith(')'):
        return None
    return row.title[row.title.rindex('(')+1:-1]

df.region_1 = df.apply(extract_region_1, axis=1)

In [ ]:

df[['region_1', 'title']].head()

Great, now let's recreate the Location:

In [ ]:

df['location'] = df.apply(lambda x: ' / '.join([y for y in [str(x['trans_country']), str(x['province']), str(x['region_1']), str(x['region_2'])] if y != 'nan']), axis=1)
df.location.head()

Now let's replace the locations with lower amount of reviews with the name 'Other'

In [ ]:

vc = df.location.value_counts()
df.location = df.location.replace(vc[vc < 2].index, 'Other')

Price¶

Price is is given in the US$, let's see how it's distributed:

In [ ]:

plt.figure(figsize=(15, 5))
data = df[~df.price.isna()]
plt.scatter(range(data.shape[0]), np.sort(data.price.values)[::-1])
plt.title("Distribution of wine prices", fontsize=18)
plt.ylabel('Price');

Wow, there are wines with more than $3000 price. That's not a usual weekend wine :)

As we see, the price distribution is very skewed, let's try to log-transform it:

In [ ]:

plt.figure(figsize=(15, 3))
series_price = df[~df.price.isna()].price.apply(np.log1p)
ax = sns.distplot(series_price);
ax.set_title("Distribution of wine prices", fontsize=18)
ax.set_ylabel('Price (log1p)')
ax.set_xlabel('');

Still, it's not normal:

In [ ]:

print('Shapiro-Wilk test:', stats.shapiro(series_price))
print('Kolmogorov-Smirnov test:', stats.kstest(series_price, cdf='norm'))

But not very skewed anymore:

In [ ]:

print('Skeweness:', series_price.skew())
print('Kurtosis:', series_price.kurt())

Now let's see a connection between the Price (not log-transformed) and Points:

In [ ]:

plt.figure(figsize=(15, 5))
ax = sns.regplot(x='points', y='price', data=df, fit_reg=False, x_jitter=True)
ax.set_title('Correlation between the wine price and points given', fontsize=18);

And now let's see which countries has the most expensive wines (per average):

In [ ]:

plt.figure(figsize=(13, 7))
data = df[['country', 'price']].groupby('country').mean().reset_index().sort_values('price', ascending=False)
ax = sns.barplot(y='country', x='price', data=data, palette='tab10')

for p, label in zip(ax.patches, data.price):
    if np.isnan(label):
        continue
    ax.annotate('{0:.2f}'.format(label), (p.get_width() + 0.2, p.get_y() + 0.5))

ax.set_title('Top countries with the most expensive average wine prices');

Insterestingly, we see, for example, Germany, Hungary and France in leaders here, which are also in leaders for average wine rating above.

Let's take the countries with the top rated wines and see the prices distribution in them:

In [ ]:

plt.figure(figsize=(15, 5))
sns.violinplot(x='country', y='price', data=top_rated_countries_data, order=top_rated_countries, palette='tab10');

Weel, not good, the Price need to be transformed.

Cleaning and transforming¶

In [ ]:

df['trans_price'] = df.price.apply(np.log1p)

In [ ]:

top_rated_countries = df[['trans_country', 'points']].groupby('trans_country').mean().reset_index().sort_values('points', ascending=False).trans_country[:10]

top_rated_countries_data = df[df.trans_country.isin(top_rated_countries)]

plt.figure(figsize=(15, 5))
sns.violinplot(x='trans_country', y='trans_price', data=top_rated_countries_data, order=top_rated_countries, palette='tab10');

In [ ]:

plt.figure(figsize=(15, 5))
ax = sns.regplot(x='points', y='trans_price', data=df, fit_reg=False, x_jitter=True)
ax.set_title('Correlation between the wine price (log) and points given', fontsize=18);

Variety¶

Let's see the top 10 varietes with their wine counts:

In [ ]:

df_top_varieties = df[df.variety.isin(df.variety.value_counts().index[:10])]
plt.figure(figsize=(13, 5))
ax = sns.countplot(y=df_top_varieties.variety, order=df_top_varieties.variety.value_counts().index, palette='tab10')
for p, label in zip(ax.patches, df_top_varieties.variety.value_counts()):
    ax.annotate("{0:,d}".format(label), (p.get_width() + 50, p.get_y() + 0.5))
ax.set_title('Number of wines per variety', fontsize=18);

Now let's see the dependency between the Variety and Points:

In [ ]:

plt.figure(figsize=(14, 10))
ax = sns.violinplot(y='variety', x='points', data=df_top_varieties, palette='tab10', order=df_top_varieties.variety.value_counts().index)
ax.set_title('Wine rating distribution over top 10 varietes by wine count', fontsize=18);

As we see, somevarietes get higher points than the other and points distribution is also may vary.

Variety has the same problem as other categorical features: there are some varietes where almost no samples, but they affect the points heavily:

In [ ]:

top_rated_varietes = df[['variety', 'points']].groupby('variety').mean().reset_index().sort_values('points', ascending=False).variety[:10]

top_rated_varietes_data = df[df.variety.isin(top_rated_varietes)]

plt.figure(figsize=(15, 5))
ax = sns.violinplot(x='variety', y='points', data=top_rated_varietes_data, order=top_rated_varietes, palette='tab10');
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

Cleaning and transforming¶

In [ ]:

vc = df.variety.value_counts()
df['trans_variety'] = df.variety.replace(vc[vc < 2].index, 'Other')

In [ ]:

top_rated_varietes = df[['trans_variety', 'points']].groupby('trans_variety').mean().reset_index().sort_values('points', ascending=False).trans_variety[:10]

top_rated_varietes_data = df[df.trans_variety.isin(top_rated_varietes)]

plt.figure(figsize=(15, 5))
ax = sns.violinplot(x='trans_variety', y='points', data=top_rated_varietes_data, order=top_rated_varietes, palette='tab10');
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

Title¶

In [ ]:

df.title.head(10)

The title itself seems to be not containing valuable information except that we already used it for filling the nulls in Region 1 and we can extract a Year (vintage) from it, we will do it later.

Description¶

That a typical textual varible, which we can try to analyze with word clouds.

Let's see what experts tell about wines that has low rating:

In [ ]:

stopwords = set(STOPWORDS)
stopwords.update(['wine', 'a', 'about', 'above', 'across', 'after', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'among', 'an', 'and', 'another', 'any', 'anybody', 'anyone', 'anything', 'anywhere', 'are', 'area', 'areas', 'around', 'as', 'ask', 'asked', 'asking', 'asks', 'at', 'away', 'b', 'back', 'backed', 'backing', 'backs', 'be', 'became', 'because', 'become', 'becomes', 'been', 'before', 'began', 'behind', 'being', 'beings', 'best', 'better', 'between', 'big', 'both', 'but', 'by', 'c', 'came', 'can', 'cannot', 'case', 'cases', 'certain', 'certainly', 'clear', 'clearly', 'come', 'could', 'd', 'did', 'differ', 'different', 'differently', 'do', 'does', 'done', 'down', 'down', 'downed', 'downing', 'downs', 'during', 'e', 'each', 'early', 'either', 'end', 'ended', 'ending', 'ends', 'enough', 'even', 'evenly', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'f', 'face', 'faces', 'fact', 'facts', 'far', 'felt', 'few', 'find', 'finds', 'first', 'for', 'four', 'from', 'full', 'fully', 'further', 'furthered', 'furthering', 'furthers', 'g', 'gave', 'general', 'generally', 'get', 'gets', 'give', 'given', 'gives', 'go', 'going', 'good', 'goods', 'got', 'great', 'greater', 'greatest', 'group', 'grouped', 'grouping', 'groups', 'h', 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'herself', 'high', 'high', 'high', 'higher', 'highest', 'him', 'himself', 'his', 'how', 'however', 'i', 'if', 'important', 'in', 'interest', 'interested', 'interesting', 'interests', 'into', 'is', 'it', 'its', 'itself', 'j', 'just', 'k', 'keep', 'keeps', 'kind', 'knew', 'know', 'known', 'knows', 'l', 'large', 'largely', 'last', 'later', 'latest', 'least', 'less', 'let', 'lets', 'like', 'likely', 'long', 'longer', 'longest', 'm', 'made', 'make', 'making', 'man', 'many', 'may', 'me', 'member', 'members', 'men', 'might', 'more', 'most', 'mostly', 'mr', 'mrs', 'much', 'must', 'my', 'myself', 'n', 'necessary', 'need', 'needed', 'needing', 'needs', 'never', 'new', 'new', 'newer', 'newest', 'next', 'no', 'nobody', 'non', 'noone', 'not', 'nothing', 'now', 'nowhere', 'number', 'numbers', 'o', 'of', 'off', 'often', 'old', 'older', 'oldest', 'on', 'once', 'one', 'only', 'open', 'opened', 'opening', 'opens', 'or', 'order', 'ordered', 'ordering', 'orders', 'other', 'others', 'our', 'out', 'over', 'p', 'part', 'parted', 'parting', 'parts', 'per', 'perhaps', 'place', 'places', 'point', 'pointed', 'pointing', 'points', 'possible', 'present', 'presented', 'presenting', 'presents', 'problem', 'problems', 'put', 'puts', 'q', 'quite', 'r', 'rather', 'really', 'right', 'right', 'room', 'rooms', 's', 'said', 'same', 'saw', 'say', 'says', 'second', 'seconds', 'see', 'seem', 'seemed', 'seeming', 'seems', 'sees', 'several', 'shall', 'she', 'should', 'show', 'showed', 'showing', 'shows', 'side', 'sides', 'since', 'small', 'smaller', 'smallest', 'so', 'some', 'somebody', 'someone', 'something', 'somewhere', 'state', 'states', 'still', 'still', 'such', 'sure', 't', 'take', 'taken', 'than', 'that', 'the', 'their', 'them', 'then', 'there', 'therefore', 'these', 'they', 'thing', 'things', 'think', 'thinks', 'this', 'those', 'though', 'thought', 'thoughts', 'three', 'through', 'thus', 'to', 'today', 'together', 'too', 'took', 'toward', 'turn', 'turned', 'turning', 'turns', 'two', 'u', 'under', 'until', 'up', 'upon', 'us', 'use', 'used', 'uses', 'v', 'very', 'w', 'want', 'wanted', 'wanting', 'wants', 'was', 'way', 'ways', 'we', 'well', 'wells', 'went', 'were', 'what', 'when', 'where', 'whether', 'which', 'while', 'who', 'whole', 'whose', 'why', 'will', 'with', 'within', 'without', 'work', 'worked', 'working', 'works', 'would', 'x', 'y', 'year', 'years', 'yet', 'you', 'young', 'younger', 'youngest', 'your', 'yours', 'z'])

wordcloud = WordCloud(background_color='white', stopwords=stopwords,
    max_words=500, max_font_size=200, width=2000, height=800,
    random_state=RANDOM_SEED).generate(' '.join(df[df.points < 83].description.str.lower()))

plt.figure(figsize=(15, 7))
plt.imshow(wordcloud)
plt.title("Low Rated Wines Description Word Cloud", fontsize=20)
plt.axis('off');

bitter, sour, simple, sharp, tart - there must be definitely something wrong with these wines!

In [ ]:

wordcloud = WordCloud(background_color='white', stopwords=stopwords,
    max_words=500, max_font_size=200, width=2000, height=800,
    random_state=RANDOM_SEED).generate(' '.join(df[df.points > 97].description.str.lower()))

plt.figure(figsize=(15, 7))
plt.imshow(wordcloud)
plt.title("High Rated Wines Description Word Cloud", fontsize=20)
plt.axis('off');

Oh yeah, much better: structured, complex, classic, rich, ripe, powerful, intense and other good words which you would expect for the pricey and high rated wines :)

Taster and their Twitter Handle¶

We don't need these fields per our goals, since we will not have them to perform predictions for the model.

Winery and Designation¶

Let's skip these features for now to see if we can use them later.

Target (Points)¶

Let's see how our target is distributed:

In [ ]:

plt.figure(figsize=(15, 5))
sns.distplot(df.points, kde=False);

Well, looks like we have binomial distribution here and while it may look like normally-distributed, the tests don't confirm it:

In [ ]:

print('Shapiro-Wilk test:', stats.shapiro(df.points))
print('Kolmogorov-Smirnov test:', stats.kstest(df.points, cdf='norm'))

Skeweness is pretty low however:

In [ ]:

print('Skeweness:', df.points.skew())
print('Kurtosis:', df.points.kurt())

Here is the QQ-plot, in addition:

In [ ]:

plt.rcParams['figure.figsize'] = (7, 7)
qqplot(df.points, line='r');

The problem here is that our Points has discrete values instead of continuous. Which might tell us that we need to treat this problem as a classification or Ordered Regression.
But still, simple regression should also work well in our case, even though the data is discrete.

Metrics Selection¶

There are two most popular metrics which we can choose from: MAE (mean absolute error) and MSE (mean squared error).

MSE would be the better choice for this problem because:

our train target does not contain outliers and its variance is relatively low. So we want our model to penalize large errors in predictions, which is an immanent feature of MSE
MSE is smoothly differentiable which makes it easier for calculations

In scikit-learn MSE is represented in negative form and has the following name, let's save it:

In [ ]:

SCORING = 'neg_mean_squared_error'

Model Selection¶

We have the following meaningful properties of our task:

it's a regression problem
we have relatively much data, >100k samples
since we have some categorical feature candidates withe a lot of unique values + textual data, we can expect that we will have a lot of features, 10k+, and much of them will be important for our predictions

We can use both SGD and Ridge giving these properties.
While SGD will be much faster than Ridge in this task, it will not give us the same level of accuracy and must be tuned a lot more than Ridge, which essentially has one hyperparameter.

So let's use Ridge and see how it will perform:

In [ ]:

MODEL = Ridge(random_state=RANDOM_SEED)

Cross-Validation Selection¶

Since our data does not have any heavy specifics, so we can choose simple KFold cross validation for 10 folds, with shuffle:

In [ ]:

CV = KFold(n_splits=10, shuffle=True, random_state=RANDOM_SEED)

Data Preprocessing¶

In [ ]:

df_full = df.copy(deep=True)
df_full = df_full.drop(['country', 'price', 'taster_name', 'taster_twitter_handle', 'variety', 'province', 'region_1', 'region_2'], axis=1)
df_full.columns = [x.replace('trans_', '') for x in df_full.columns]
df_full.info()

Dealing with nulls¶

In [ ]:

# Fill missing countries with "Other"
df_full.country = df_full.country.fillna('Other')

# Fill missing locations with "Other"
df_full.location = df_full.location.fillna('Other')

# Remove samples with missing prices since there are not so much of them and it's and important feature
df_full = df_full[~df_full.price.isna()]

df_full.info()

Train-test split¶

In [ ]:

df_train, df_test, y_train, y_test = train_test_split(df_full.drop(['points'], axis=1), df_full.points, test_size=0.25, random_state=RANDOM_SEED)
df_train.shape, df_test.shape, y_train.shape, y_test.shape

Note that we will be processing the train and test sets separately, to not introduce "looking into the future" problem.

Categorical features encoding¶

In [ ]:

categorical_features = ['country', 'variety', 'location']

for feature in categorical_features:
    categorical = pd.Categorical(df_train[feature].unique())
    
    df_train[feature] = df_train[feature].astype(categorical)
    df_test[feature] = df_test[feature].astype(categorical)
    
X_train_cat = pd.get_dummies(df_train[categorical_features], sparse=True)
X_test_cat = pd.get_dummies(df_test[categorical_features], sparse=True)

X_train_cat.shape, X_test_cat.shape

Text vectorization with TF-IDF¶

In [ ]:

tv = TfidfVectorizer(stop_words=stopwords, max_features=10000)
X_train_desc = tv.fit_transform(df_train.description)
X_test_desc = tv.transform(df_test.description)
X_train_desc.shape, X_test_desc.shape

Scaling numerical features¶

Our model is sensitive to non-centered numeric features, so we need to scale them:

In [ ]:

ss = StandardScaler()
df_train.price = ss.fit_transform(df_train[['price']])
df_test.price = ss.transform(df_test[['price']])

Getting features together¶

In [ ]:

X_train = csr_matrix(hstack([
    df_train[['price']],
    X_train_cat,
    X_train_desc,
]))
X_test = csr_matrix(hstack([
    df_test[['price']],
    X_test_cat,
    X_test_desc,
]))
X_train.shape, X_test.shape

Getting preprocessing steps together¶

In [ ]:

def prepare_data(df_full, categorical_features):
    df_train, df_test, y_train, y_test = train_test_split(df_full.drop(['points'], axis=1), df_full.points, test_size=0.25, random_state=RANDOM_SEED)
    df_train.shape, df_test.shape, y_train.shape, y_test.shape

    print('processing categorical features')
    for feature in categorical_features:
        categorical = pd.Categorical(df_train[feature].unique())

        df_train[feature] = df_train[feature].astype(categorical)
        df_test[feature] = df_test[feature].astype(categorical)

    print('preparing dummies')
    X_train_cat = pd.get_dummies(df_train[categorical_features], sparse=True)
    X_test_cat = pd.get_dummies(df_test[categorical_features], sparse=True)
    
    print('extracting word vectors')
    tv = TfidfVectorizer(stop_words=stopwords, max_features=10000)
    X_train_desc = tv.fit_transform(df_train.description)
    X_test_desc = tv.transform(df_test.description)
    X_train_desc.shape, X_test_desc.shape
    
    print('scaling')
    ss = StandardScaler()
    df_train.price = ss.fit_transform(df_train[['price']])
    df_test.price = ss.transform(df_test[['price']])
    df_train.describe()

    print('combining features')
    X_train = csr_matrix(hstack([
        df_train[['price']],
        X_train_cat,
        X_train_desc,
    ]))
    X_test = csr_matrix(hstack([
        df_test[['price']],
        X_test_cat,
        X_test_desc,
    ]))

    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location'])
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Training a Model¶

Let's fit our model for the first time and see how it will perform:

In [ ]:

def train_and_cv(model, X_train, y_train, X_test, y_test):
    cvs = cross_val_score(model, X_train, y_train, cv=CV, scoring=SCORING, n_jobs=-1)
    print('MSE and STD on CV:\t', -cvs.mean(), cvs.std())
    
    model.fit(X_train, y_train)
    print('MSE on holdout:\t\t', mean_squared_error(MODEL.predict(X_test), y_test))
    
    return model

train_and_cv(MODEL, X_train, y_train, X_test, y_test);

And the results are pretty good, we already have good relatively low error. But we definitely can improve it even further.

Now let's see the learning curve:

In [ ]:

def plot_learning_curve(model, X_train, y_train):
    plt.figure(figsize=(10, 5))
    viz = LearningCurve(model, cv=CV, train_sizes=np.linspace(.1, 1.0, 10), scoring=SCORING, n_jobs=-1)
    viz.fit(X_train, y_train)
    viz.poof()
    
plot_learning_curve(MODEL, X_train, y_train);

We see a typical picture when the training score is decreasing with the number of samples given to model and cross-val scorr is increasing.
But still, there is gap between them, so our model will be improved with larger number of provided samples.
Also it is worth noticing that the variance of the cross-val scores is pretty low, which tells us that our model gives pretty stable predictions.

Hyperparameter Tuning¶

One of the key characteristics of our model is its simplicity, so we have only on parameter to adjust: alpha, which is regularization strength for ridge regularization.
Let's try to see how what is the best value for it, visually:

In [ ]:

plt.figure(figsize=(10, 5))
viz = ValidationCurve(MODEL, param_name='alpha', cv=CV, param_range=np.logspace(-1, 1, 10), logx=True, scoring=SCORING, n_jobs=-1)
viz.fit(X_train, y_train)
viz.poof()

As we see, we have a best value for our model located at the maximum of the cross-val curve.
The variance is still low, as on learning curve, which is a good indicator.

Now let's calculate the best alpha using simple grid search:

In [ ]:

params = { 'alpha': np.logspace(-1, 1, 10) }

gs = GridSearchCV(MODEL, param_grid=params, verbose=10, n_jobs=-1, cv=CV, scoring=SCORING)
gs.fit(X_train, y_train)

print('Best alpha:', gs.best_params_['alpha'])

In [ ]:

MODEL = Ridge(alpha=gs.best_params_['alpha'], random_state=RANDOM_SEED)
train_and_cv(MODEL, X_train, y_train, X_test, y_test);

Feature Engineering¶

Winery + Designation¶

Let's see how adding the combination of the Winery and Designation (designation is a part of a winery, so they are expected to processed together) affect our model.
Specific winery and its designation may define a specific winery "factory" where the wine is produced and may affect the wine quality.

In [ ]:

df_full['winery_designation'] = df_full.winery + ' / ' + df_full.designation
vc = df_full.winery_designation.value_counts()
df_full.winery_designation = df_full.winery_designation.replace(vc[vc < 2].index, 'Other')
df_full.winery_designation = df_full.winery_designation.fillna('Other')

print('Number of unique winery + designation:', len(df_full.winery_designation.unique()))

In [ ]:

X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location', 'winery_designation'])
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [ ]:

train_and_cv(MODEL, X_train, y_train, X_test, y_test);

Our assumption was correct and performance of the model is improved, so let's keep this feature.

Year (Vintage)¶

Turns out, we have a year inside the title:

In [ ]:

df_full.title.head()

Wine year is the when the grape was collected and this is a categorical feature in our case.
Year can tell about the weather and other conditions related to the specific harvest and often affect the quality of the wine.

In [ ]:

def extract_year(title):
    matches = re.findall(r'\d{4}', title)
    return next(filter(lambda x: 1000 < x <= 2018, map(int, matches)), 0)

df_full['year'] = df.title.apply(extract_year)
df_full.year = df_full.year.fillna(0)

print('Number of unique years:', len(df_full.year.unique()))

In [ ]:

X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location', 'winery_designation', 'year'])
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [ ]:

train_and_cv(MODEL, X_train, y_train, X_test, y_test);

And again, the performance of the model is improved, we will keep this feature.

Winery + Year¶

We can pretend that the winery and year together may define a quality of the wine - for example, if for some winery the weather was good and it had a good financial status in a specific year, we can expect better grape quality from it.

In [ ]:

df_full['winery_year'] = df_full.winery + ' / ' + df_full.year.astype(str)
vc = df_full.winery_year.value_counts()
df_full.winery_year = df_full.winery_year.replace(vc[vc < 2].index, 'Other')
df_full.winery_year = df_full.winery_year.fillna('Other')

print('Number of unique winery + year:', len(df_full.winery_year.unique()))

In [ ]:

X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location', 'winery_designation', 'year', 'winery_year'])
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [ ]:

train_and_cv(MODEL, X_train, y_train, X_test, y_test);

We've got minor improvement in holdout score, but worse CV score.
Since this feature also inroduce a lot of dummy values, let's not add it to the model.

Retrain the Best Model¶

In [ ]:

X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location', 'winery_designation', 'year'])
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [ ]:

train_and_cv(MODEL, X_train, y_train, X_test, y_test)

Now our model is ready to be used or improved further.
We can train it on the whole dataset (train+test) to get better results in real use, as our learning curve suggested.

Conclusions¶

We've reached our goal on building a model which may predict the wine rating based on the wine features and textual description.
Models of such type can be used in the wine industry to predict wine ratings and augment predictions of an experts to find and resolve their biases.

Also it is possible to deploy this model in a form of a web or mobile application to allow wine buyers to get predictions for the random wines in the local stores (they can use wine description on the bottle in this case).

However, the following things can be tried to improve the model:

we can approach a problem from the classification perspective and build a classifier instead of regressor, or implement an ordered regression;
the words in the Description can be stemmed, we can use word2vec, GloVe, LDA to improve features extracted from this field;
if we need to retrain the model often, better to switch to SGDRegressor, which will be way faster for this task;
we can apply feature selection to remove noisy and unnecessary features which may improve the accuracy and speed of the model;
it is possible to get the data apart from the dataset provided, to add some other features - for example, the wheater conditions during specific year in a specific location may greatly affect the wine quality.

Choose the best wines and drink responsibly! :)

In [ ]:

mlcourse.ai – Open Machine Learning Course¶

Individual Project¶

Predicting Wine Expert Rating¶

Table of Contents

Project Description¶

Dataset¶

Features¶

Target¶

Our goal and possible applications¶

Understanding the unrated wine quality¶

Blind testing the expert predictions¶

Data Analysis and Cleaning¶

Country¶

Cleaning and transforming¶

Province, Region 1 and Region 2¶

Cleaning and transforming¶

Price¶

Cleaning and transforming¶

Variety¶

Cleaning and transforming¶

Title¶

Description¶

Taster and their Twitter Handle¶

Winery and Designation¶

Target (Points)¶

Metrics Selection¶

Model Selection¶

Cross-Validation Selection¶

Data Preprocessing¶

Dealing with nulls¶

Train-test split¶

Categorical features encoding¶

Text vectorization with TF-IDF¶

Scaling numerical features¶

Getting features together¶

Getting preprocessing steps together¶

Training a Model¶

Hyperparameter Tuning¶

Feature Engineering¶

Winery + Designation¶

Year (Vintage)¶

Winery + Year¶

Retrain the Best Model¶

Conclusions¶