import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
import plotly.offline as py
import warnings
import pycountry
from statsmodels.graphics.gofplots import qqplot
from wordcloud import WordCloud, STOPWORDS
warnings.filterwarnings('ignore')
import re
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix, hstack
from yellowbrick.model_selection import ValidationCurve, LearningCurve
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
RANDOM_SEED=17
The data is taken from the Kaggle dataset https://www.kaggle.com/zynicide/wine-reviews/home, which in turn was scraped by the dataset's author from https://www.winemag.com/
There are lot of reviews from differents experts for the wines from the whole world. Also, some wine-specific information is also provided as a part of the dataset.
Dataset consists of the following fields (per info from https://github.com/zackthoutt/wine-deep-learning):
We have wine rating (Points) as a target. Reviewers from the original site provide rating for the wines varying from 80 to 100, here is the details of different ranges:
Range | Mark | Description |
---|---|---|
98–100 | Classic | The pinnacle of quality |
94–97 | Superb | A great achievement |
90–93 | Excellent | Highly recommended |
87–89 | Very | Often good value; well recommended |
83–86 | Good | Suitable for everyday consumption; often good value |
80–82 | Acceptable | Can be employed in casual, less-critical circumstances |
Originally, dataset author collected the data to create a predictive model to identify wines through blind tasting like a master sommelier would
.
Here we will try to solve simpler, yet useful in real life, task: predict the wine rating based on the wine features and words used in its review. This can have the following practical applications:
Unlike other beverages, wines comes in overwhelming variety: it's about 10k grapes exists (and their number is growing), they can be blended in different proportions, the grape collection year and growing conditions comes into play, the wine may be seasoned for different amount of time in different types of barrels, etc, etc.
So review of the specific wine or lists like "top 10 wines of the season" doesn't make any sense - if you go to 2 different local stores there is a good chance you won't find the same wine in both of them. Finding the specific wine may require journey to another city or even country :) In such conditions it's worth to have a model which may predict the wine quality without having an exact rating given by the expert, but based on the wine features which you can get from the bottle.
While this is an area of purely personal taste, professionals always try to become free from the biases and provide objective observations. Blind testing may allow to find the biases of the specific reviewer.
Actually, the model could be used for cross-validation of the expert ratings :)
Let's download the data from Kaggle, extract them into data
folder and check the main properties of the resulting DataFrame:
df = pd.read_csv('data/winemag-data-130k-v2.csv', index_col=0)
df.info(memory_usage='deep')
As we can see, there are many null values in the data, we need to deal with them later.
df.head()
Let's check the data for possible categorical features:
df.nunique()
Looks like the following features can be represented as categorical:
Let's explore the data now to get acquainted to the dataset more closely:
plt.figure(figsize=(13, 10))
ax = sns.countplot(y=df.country, order=df.country.value_counts().index, palette='tab10')
for p, label in zip(ax.patches, df.country.value_counts()):
ax.annotate("{0:,d}".format(label), (p.get_width() + 50, p.get_y() + 0.7))
ax.set_title('Number of wine reviews per country', fontsize=18);
We see that we have a lot of the reviews for the wines from US, which can be explained by the fact that the reviewers are mostly located in the US.
Also it should be noted that we have countries with less number of reviews which may cause problems.
Let's see how the countries are distributed on the map, along with the number if review in them.
For the Choropleth to display the coloring in a more understantable way, let's log1p
-transform the number of reviews per country:
countries = df.groupby('country').size().reset_index()
countries.columns = ['name', 'size']
countries.name = countries.name.replace({ # making the country names compatible with pycountry
'England': 'United Kingdom',
'Czech Republic': 'United Kingdom',
'Macedonia': 'Macedonia, Republic of',
'Moldova': 'Moldova, Republic of',
'US': 'United States'
})
data = pd.DataFrame(index=countries.index)
data['name'] = countries.name
data['size'] = countries['size']
data['code'] = countries.apply(lambda x: pycountry.countries.get(name=x['name']), axis=1)
data['code'] = data.code.apply(lambda x: x.alpha_3 if x else None)
data = data.dropna()
choropleth_data = [dict(
type='choropleth',
locations=data['code'],
z=np.log1p(data['size']),
#showscale=False,
text=data['name'],
marker=dict(
line=dict(
color='rgb(180,180,180)',
width=0.5
)),
)]
layout = dict(
title='Number of wine reviews per country, log-transformed',
geo=dict(
showframe=False,
showcoastlines=True,
projection=dict(
type='natural earth'
)
))
fig = dict(data=choropleth_data, layout=layout)
py.iplot(fig, validate=False)
top_rated_countries = df[['country', 'points']].groupby('country').mean().reset_index().sort_values('points', ascending=False).country[:10]
data = df[df.country.isin(top_rated_countries)]
plt.figure(figsize=(15, 7))
ax = sns.violinplot(x='country', y='points', data=data, order=top_rated_countries, palette='tab10')
ax.set_title('Top 10 countries with highest average rating', fontsize=18);
Here we can see that the some of the countries with low number of reviews has pretty high average rating.
Probably, it's because wines with the highest potential rating are the first to be reviewed by the experts.
The dependency between the Country and Points is clear.
Countries with less number of reviews does not have too much predictive power and introduce unnecessary noise, so let's replace them with the name 'Other' instead:
vc = df.country.value_counts()
df['trans_country'] = df.country.replace(vc[vc < 100].index, 'Other')
top_rated_countries = df[['trans_country', 'points']].groupby('trans_country').mean().reset_index().sort_values('points', ascending=False).trans_country[:10]
top_rated_countries_data = df[df.trans_country.isin(top_rated_countries)]
plt.figure(figsize=(15, 7))
ax = sns.violinplot(x='trans_country', y='points', data=top_rated_countries_data, order=top_rated_countries, palette='tab10')
ax.set_title('Top 10 countries with highest average rating', fontsize=18);
Now see better distribution of the rating among countries in the top 10 list.
These features are actually parts of the wine location hierarchy, so they better be joined into one field with the Country. Let's take a look at them:
df[['trans_country', 'province', 'region_1', 'region_2']].head()
df[['trans_country', 'province', 'region_1', 'region_2']].nunique()
print('Countries with Region 2:', df[~df.region_2.isna()].trans_country.unique())
Looks like Region 2 is a US-specific feature, but it won't hurt if we include it as well, so we get better categorization for US wines.
df['location'] = df.apply(lambda x: ' / '.join([y for y in [str(x['trans_country']), str(x['province']), str(x['region_1']), str(x['region_2'])] if y != 'nan']), axis=1)
df.location.head()
Now let's try to see if there is a dependency between the Points and Location:
df_top_locations = df[df.location.isin(df.location.value_counts().index[:10])]
plt.figure(figsize=(12, 10))
ax = sns.violinplot(y='location', x='points', data=df_top_locations, palette='tab10');
ax.set_title('Wine rating distribution over top 10 locations with highest average rating', fontsize=18);
Let's see if we can get something from the title:
df[['region_1', 'title']].head()
As we can see, some regions are repeated in title and even if region is NaN, it is possible to fill it with the value from the title, so let's do it:
def extract_region_1(row):
if row.region_1 == 'nan':
return row.region_1
if not row.title.endswith(')'):
return None
return row.title[row.title.rindex('(')+1:-1]
df.region_1 = df.apply(extract_region_1, axis=1)
df[['region_1', 'title']].head()
Great, now let's recreate the Location:
df['location'] = df.apply(lambda x: ' / '.join([y for y in [str(x['trans_country']), str(x['province']), str(x['region_1']), str(x['region_2'])] if y != 'nan']), axis=1)
df.location.head()
Now let's replace the locations with lower amount of reviews with the name 'Other'
vc = df.location.value_counts()
df.location = df.location.replace(vc[vc < 2].index, 'Other')
Price is is given in the US$, let's see how it's distributed:
plt.figure(figsize=(15, 5))
data = df[~df.price.isna()]
plt.scatter(range(data.shape[0]), np.sort(data.price.values)[::-1])
plt.title("Distribution of wine prices", fontsize=18)
plt.ylabel('Price');
Wow, there are wines with more than $3000 price. That's not a usual weekend wine :)
As we see, the price distribution is very skewed, let's try to log-transform it:
plt.figure(figsize=(15, 3))
series_price = df[~df.price.isna()].price.apply(np.log1p)
ax = sns.distplot(series_price);
ax.set_title("Distribution of wine prices", fontsize=18)
ax.set_ylabel('Price (log1p)')
ax.set_xlabel('');
Still, it's not normal:
print('Shapiro-Wilk test:', stats.shapiro(series_price))
print('Kolmogorov-Smirnov test:', stats.kstest(series_price, cdf='norm'))
But not very skewed anymore:
print('Skeweness:', series_price.skew())
print('Kurtosis:', series_price.kurt())
Now let's see a connection between the Price (not log-transformed) and Points:
plt.figure(figsize=(15, 5))
ax = sns.regplot(x='points', y='price', data=df, fit_reg=False, x_jitter=True)
ax.set_title('Correlation between the wine price and points given', fontsize=18);
And now let's see which countries has the most expensive wines (per average):
plt.figure(figsize=(13, 7))
data = df[['country', 'price']].groupby('country').mean().reset_index().sort_values('price', ascending=False)
ax = sns.barplot(y='country', x='price', data=data, palette='tab10')
for p, label in zip(ax.patches, data.price):
if np.isnan(label):
continue
ax.annotate('{0:.2f}'.format(label), (p.get_width() + 0.2, p.get_y() + 0.5))
ax.set_title('Top countries with the most expensive average wine prices');
Insterestingly, we see, for example, Germany, Hungary and France in leaders here, which are also in leaders for average wine rating above.
Let's take the countries with the top rated wines and see the prices distribution in them:
plt.figure(figsize=(15, 5))
sns.violinplot(x='country', y='price', data=top_rated_countries_data, order=top_rated_countries, palette='tab10');
Weel, not good, the Price need to be transformed.
df['trans_price'] = df.price.apply(np.log1p)
top_rated_countries = df[['trans_country', 'points']].groupby('trans_country').mean().reset_index().sort_values('points', ascending=False).trans_country[:10]
top_rated_countries_data = df[df.trans_country.isin(top_rated_countries)]
plt.figure(figsize=(15, 5))
sns.violinplot(x='trans_country', y='trans_price', data=top_rated_countries_data, order=top_rated_countries, palette='tab10');
plt.figure(figsize=(15, 5))
ax = sns.regplot(x='points', y='trans_price', data=df, fit_reg=False, x_jitter=True)
ax.set_title('Correlation between the wine price (log) and points given', fontsize=18);
Let's see the top 10 varietes with their wine counts:
df_top_varieties = df[df.variety.isin(df.variety.value_counts().index[:10])]
plt.figure(figsize=(13, 5))
ax = sns.countplot(y=df_top_varieties.variety, order=df_top_varieties.variety.value_counts().index, palette='tab10')
for p, label in zip(ax.patches, df_top_varieties.variety.value_counts()):
ax.annotate("{0:,d}".format(label), (p.get_width() + 50, p.get_y() + 0.5))
ax.set_title('Number of wines per variety', fontsize=18);
Now let's see the dependency between the Variety and Points:
plt.figure(figsize=(14, 10))
ax = sns.violinplot(y='variety', x='points', data=df_top_varieties, palette='tab10', order=df_top_varieties.variety.value_counts().index)
ax.set_title('Wine rating distribution over top 10 varietes by wine count', fontsize=18);
As we see, somevarietes get higher points than the other and points distribution is also may vary.
Variety has the same problem as other categorical features: there are some varietes where almost no samples, but they affect the points heavily:
top_rated_varietes = df[['variety', 'points']].groupby('variety').mean().reset_index().sort_values('points', ascending=False).variety[:10]
top_rated_varietes_data = df[df.variety.isin(top_rated_varietes)]
plt.figure(figsize=(15, 5))
ax = sns.violinplot(x='variety', y='points', data=top_rated_varietes_data, order=top_rated_varietes, palette='tab10');
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);
vc = df.variety.value_counts()
df['trans_variety'] = df.variety.replace(vc[vc < 2].index, 'Other')
top_rated_varietes = df[['trans_variety', 'points']].groupby('trans_variety').mean().reset_index().sort_values('points', ascending=False).trans_variety[:10]
top_rated_varietes_data = df[df.trans_variety.isin(top_rated_varietes)]
plt.figure(figsize=(15, 5))
ax = sns.violinplot(x='trans_variety', y='points', data=top_rated_varietes_data, order=top_rated_varietes, palette='tab10');
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);
df.title.head(10)
The title itself seems to be not containing valuable information except that we already used it for filling the nulls in Region 1 and we can extract a Year (vintage) from it, we will do it later.
That a typical textual varible, which we can try to analyze with word clouds.
Let's see what experts tell about wines that has low rating:
stopwords = set(STOPWORDS)
stopwords.update(['wine', 'a', 'about', 'above', 'across', 'after', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'among', 'an', 'and', 'another', 'any', 'anybody', 'anyone', 'anything', 'anywhere', 'are', 'area', 'areas', 'around', 'as', 'ask', 'asked', 'asking', 'asks', 'at', 'away', 'b', 'back', 'backed', 'backing', 'backs', 'be', 'became', 'because', 'become', 'becomes', 'been', 'before', 'began', 'behind', 'being', 'beings', 'best', 'better', 'between', 'big', 'both', 'but', 'by', 'c', 'came', 'can', 'cannot', 'case', 'cases', 'certain', 'certainly', 'clear', 'clearly', 'come', 'could', 'd', 'did', 'differ', 'different', 'differently', 'do', 'does', 'done', 'down', 'down', 'downed', 'downing', 'downs', 'during', 'e', 'each', 'early', 'either', 'end', 'ended', 'ending', 'ends', 'enough', 'even', 'evenly', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'f', 'face', 'faces', 'fact', 'facts', 'far', 'felt', 'few', 'find', 'finds', 'first', 'for', 'four', 'from', 'full', 'fully', 'further', 'furthered', 'furthering', 'furthers', 'g', 'gave', 'general', 'generally', 'get', 'gets', 'give', 'given', 'gives', 'go', 'going', 'good', 'goods', 'got', 'great', 'greater', 'greatest', 'group', 'grouped', 'grouping', 'groups', 'h', 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'herself', 'high', 'high', 'high', 'higher', 'highest', 'him', 'himself', 'his', 'how', 'however', 'i', 'if', 'important', 'in', 'interest', 'interested', 'interesting', 'interests', 'into', 'is', 'it', 'its', 'itself', 'j', 'just', 'k', 'keep', 'keeps', 'kind', 'knew', 'know', 'known', 'knows', 'l', 'large', 'largely', 'last', 'later', 'latest', 'least', 'less', 'let', 'lets', 'like', 'likely', 'long', 'longer', 'longest', 'm', 'made', 'make', 'making', 'man', 'many', 'may', 'me', 'member', 'members', 'men', 'might', 'more', 'most', 'mostly', 'mr', 'mrs', 'much', 'must', 'my', 'myself', 'n', 'necessary', 'need', 'needed', 'needing', 'needs', 'never', 'new', 'new', 'newer', 'newest', 'next', 'no', 'nobody', 'non', 'noone', 'not', 'nothing', 'now', 'nowhere', 'number', 'numbers', 'o', 'of', 'off', 'often', 'old', 'older', 'oldest', 'on', 'once', 'one', 'only', 'open', 'opened', 'opening', 'opens', 'or', 'order', 'ordered', 'ordering', 'orders', 'other', 'others', 'our', 'out', 'over', 'p', 'part', 'parted', 'parting', 'parts', 'per', 'perhaps', 'place', 'places', 'point', 'pointed', 'pointing', 'points', 'possible', 'present', 'presented', 'presenting', 'presents', 'problem', 'problems', 'put', 'puts', 'q', 'quite', 'r', 'rather', 'really', 'right', 'right', 'room', 'rooms', 's', 'said', 'same', 'saw', 'say', 'says', 'second', 'seconds', 'see', 'seem', 'seemed', 'seeming', 'seems', 'sees', 'several', 'shall', 'she', 'should', 'show', 'showed', 'showing', 'shows', 'side', 'sides', 'since', 'small', 'smaller', 'smallest', 'so', 'some', 'somebody', 'someone', 'something', 'somewhere', 'state', 'states', 'still', 'still', 'such', 'sure', 't', 'take', 'taken', 'than', 'that', 'the', 'their', 'them', 'then', 'there', 'therefore', 'these', 'they', 'thing', 'things', 'think', 'thinks', 'this', 'those', 'though', 'thought', 'thoughts', 'three', 'through', 'thus', 'to', 'today', 'together', 'too', 'took', 'toward', 'turn', 'turned', 'turning', 'turns', 'two', 'u', 'under', 'until', 'up', 'upon', 'us', 'use', 'used', 'uses', 'v', 'very', 'w', 'want', 'wanted', 'wanting', 'wants', 'was', 'way', 'ways', 'we', 'well', 'wells', 'went', 'were', 'what', 'when', 'where', 'whether', 'which', 'while', 'who', 'whole', 'whose', 'why', 'will', 'with', 'within', 'without', 'work', 'worked', 'working', 'works', 'would', 'x', 'y', 'year', 'years', 'yet', 'you', 'young', 'younger', 'youngest', 'your', 'yours', 'z'])
wordcloud = WordCloud(background_color='white', stopwords=stopwords,
max_words=500, max_font_size=200, width=2000, height=800,
random_state=RANDOM_SEED).generate(' '.join(df[df.points < 83].description.str.lower()))
plt.figure(figsize=(15, 7))
plt.imshow(wordcloud)
plt.title("Low Rated Wines Description Word Cloud", fontsize=20)
plt.axis('off');
bitter
, sour
, simple
, sharp
, tart
- there must be definitely something wrong with these wines!
wordcloud = WordCloud(background_color='white', stopwords=stopwords,
max_words=500, max_font_size=200, width=2000, height=800,
random_state=RANDOM_SEED).generate(' '.join(df[df.points > 97].description.str.lower()))
plt.figure(figsize=(15, 7))
plt.imshow(wordcloud)
plt.title("High Rated Wines Description Word Cloud", fontsize=20)
plt.axis('off');
Oh yeah, much better: structured
, complex
, classic
, rich
, ripe
, powerful
, intense
and other good words which you would expect for the pricey and high rated wines :)
We don't need these fields per our goals, since we will not have them to perform predictions for the model.
Let's skip these features for now to see if we can use them later.
Let's see how our target is distributed:
plt.figure(figsize=(15, 5))
sns.distplot(df.points, kde=False);
Well, looks like we have binomial distribution here and while it may look like normally-distributed, the tests don't confirm it:
print('Shapiro-Wilk test:', stats.shapiro(df.points))
print('Kolmogorov-Smirnov test:', stats.kstest(df.points, cdf='norm'))
Skeweness is pretty low however:
print('Skeweness:', df.points.skew())
print('Kurtosis:', df.points.kurt())
Here is the QQ-plot, in addition:
plt.rcParams['figure.figsize'] = (7, 7)
qqplot(df.points, line='r');
The problem here is that our Points has discrete values instead of continuous. Which might tell us that we need to treat this problem as a classification or Ordered Regression.
But still, simple regression should also work well in our case, even though the data is discrete.
There are two most popular metrics which we can choose from: MAE (mean absolute error) and MSE (mean squared error).
MSE would be the better choice for this problem because:
In scikit-learn MSE is represented in negative form and has the following name, let's save it:
SCORING = 'neg_mean_squared_error'
We have the following meaningful properties of our task:
We can use both SGD and Ridge giving these properties.
While SGD will be much faster than Ridge in this task, it will not give us the same level of accuracy and must be tuned a lot more than Ridge, which essentially has one hyperparameter.
So let's use Ridge and see how it will perform:
MODEL = Ridge(random_state=RANDOM_SEED)
Since our data does not have any heavy specifics, so we can choose simple KFold cross validation for 10 folds, with shuffle:
CV = KFold(n_splits=10, shuffle=True, random_state=RANDOM_SEED)
df_full = df.copy(deep=True)
df_full = df_full.drop(['country', 'price', 'taster_name', 'taster_twitter_handle', 'variety', 'province', 'region_1', 'region_2'], axis=1)
df_full.columns = [x.replace('trans_', '') for x in df_full.columns]
df_full.info()
# Fill missing countries with "Other"
df_full.country = df_full.country.fillna('Other')
# Fill missing locations with "Other"
df_full.location = df_full.location.fillna('Other')
# Remove samples with missing prices since there are not so much of them and it's and important feature
df_full = df_full[~df_full.price.isna()]
df_full.info()
df_train, df_test, y_train, y_test = train_test_split(df_full.drop(['points'], axis=1), df_full.points, test_size=0.25, random_state=RANDOM_SEED)
df_train.shape, df_test.shape, y_train.shape, y_test.shape
Note that we will be processing the train and test sets separately, to not introduce "looking into the future" problem.
categorical_features = ['country', 'variety', 'location']
for feature in categorical_features:
categorical = pd.Categorical(df_train[feature].unique())
df_train[feature] = df_train[feature].astype(categorical)
df_test[feature] = df_test[feature].astype(categorical)
X_train_cat = pd.get_dummies(df_train[categorical_features], sparse=True)
X_test_cat = pd.get_dummies(df_test[categorical_features], sparse=True)
X_train_cat.shape, X_test_cat.shape
tv = TfidfVectorizer(stop_words=stopwords, max_features=10000)
X_train_desc = tv.fit_transform(df_train.description)
X_test_desc = tv.transform(df_test.description)
X_train_desc.shape, X_test_desc.shape
Our model is sensitive to non-centered numeric features, so we need to scale them:
ss = StandardScaler()
df_train.price = ss.fit_transform(df_train[['price']])
df_test.price = ss.transform(df_test[['price']])
X_train = csr_matrix(hstack([
df_train[['price']],
X_train_cat,
X_train_desc,
]))
X_test = csr_matrix(hstack([
df_test[['price']],
X_test_cat,
X_test_desc,
]))
X_train.shape, X_test.shape
def prepare_data(df_full, categorical_features):
df_train, df_test, y_train, y_test = train_test_split(df_full.drop(['points'], axis=1), df_full.points, test_size=0.25, random_state=RANDOM_SEED)
df_train.shape, df_test.shape, y_train.shape, y_test.shape
print('processing categorical features')
for feature in categorical_features:
categorical = pd.Categorical(df_train[feature].unique())
df_train[feature] = df_train[feature].astype(categorical)
df_test[feature] = df_test[feature].astype(categorical)
print('preparing dummies')
X_train_cat = pd.get_dummies(df_train[categorical_features], sparse=True)
X_test_cat = pd.get_dummies(df_test[categorical_features], sparse=True)
print('extracting word vectors')
tv = TfidfVectorizer(stop_words=stopwords, max_features=10000)
X_train_desc = tv.fit_transform(df_train.description)
X_test_desc = tv.transform(df_test.description)
X_train_desc.shape, X_test_desc.shape
print('scaling')
ss = StandardScaler()
df_train.price = ss.fit_transform(df_train[['price']])
df_test.price = ss.transform(df_test[['price']])
df_train.describe()
print('combining features')
X_train = csr_matrix(hstack([
df_train[['price']],
X_train_cat,
X_train_desc,
]))
X_test = csr_matrix(hstack([
df_test[['price']],
X_test_cat,
X_test_desc,
]))
return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location'])
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Let's fit our model for the first time and see how it will perform:
def train_and_cv(model, X_train, y_train, X_test, y_test):
cvs = cross_val_score(model, X_train, y_train, cv=CV, scoring=SCORING, n_jobs=-1)
print('MSE and STD on CV:\t', -cvs.mean(), cvs.std())
model.fit(X_train, y_train)
print('MSE on holdout:\t\t', mean_squared_error(MODEL.predict(X_test), y_test))
return model
train_and_cv(MODEL, X_train, y_train, X_test, y_test);
And the results are pretty good, we already have good relatively low error. But we definitely can improve it even further.
Now let's see the learning curve:
def plot_learning_curve(model, X_train, y_train):
plt.figure(figsize=(10, 5))
viz = LearningCurve(model, cv=CV, train_sizes=np.linspace(.1, 1.0, 10), scoring=SCORING, n_jobs=-1)
viz.fit(X_train, y_train)
viz.poof()
plot_learning_curve(MODEL, X_train, y_train);
We see a typical picture when the training score is decreasing with the number of samples given to model and cross-val scorr is increasing.
But still, there is gap between them, so our model will be improved with larger number of provided samples.
Also it is worth noticing that the variance of the cross-val scores is pretty low, which tells us that our model gives pretty stable predictions.
plt.figure(figsize=(10, 5))
viz = ValidationCurve(MODEL, param_name='alpha', cv=CV, param_range=np.logspace(-1, 1, 10), logx=True, scoring=SCORING, n_jobs=-1)
viz.fit(X_train, y_train)
viz.poof()
As we see, we have a best value for our model located at the maximum of the cross-val curve.
The variance is still low, as on learning curve, which is a good indicator.
Now let's calculate the best alpha
using simple grid search:
params = { 'alpha': np.logspace(-1, 1, 10) }
gs = GridSearchCV(MODEL, param_grid=params, verbose=10, n_jobs=-1, cv=CV, scoring=SCORING)
gs.fit(X_train, y_train)
print('Best alpha:', gs.best_params_['alpha'])
MODEL = Ridge(alpha=gs.best_params_['alpha'], random_state=RANDOM_SEED)
train_and_cv(MODEL, X_train, y_train, X_test, y_test);
Let's see how adding the combination of the Winery and Designation (designation is a part of a winery, so they are expected to processed together) affect our model.
Specific winery and its designation may define a specific winery "factory" where the wine is produced and may affect the wine quality.
df_full['winery_designation'] = df_full.winery + ' / ' + df_full.designation
vc = df_full.winery_designation.value_counts()
df_full.winery_designation = df_full.winery_designation.replace(vc[vc < 2].index, 'Other')
df_full.winery_designation = df_full.winery_designation.fillna('Other')
print('Number of unique winery + designation:', len(df_full.winery_designation.unique()))
X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location', 'winery_designation'])
X_train.shape, X_test.shape, y_train.shape, y_test.shape
train_and_cv(MODEL, X_train, y_train, X_test, y_test);
Our assumption was correct and performance of the model is improved, so let's keep this feature.
Turns out, we have a year inside the title:
df_full.title.head()
Wine year is the when the grape was collected and this is a categorical feature in our case.
Year can tell about the weather and other conditions related to the specific harvest and often affect the quality of the wine.
def extract_year(title):
matches = re.findall(r'\d{4}', title)
return next(filter(lambda x: 1000 < x <= 2018, map(int, matches)), 0)
df_full['year'] = df.title.apply(extract_year)
df_full.year = df_full.year.fillna(0)
print('Number of unique years:', len(df_full.year.unique()))
X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location', 'winery_designation', 'year'])
X_train.shape, X_test.shape, y_train.shape, y_test.shape
train_and_cv(MODEL, X_train, y_train, X_test, y_test);
And again, the performance of the model is improved, we will keep this feature.
We can pretend that the winery and year together may define a quality of the wine - for example, if for some winery the weather was good and it had a good financial status in a specific year, we can expect better grape quality from it.
df_full['winery_year'] = df_full.winery + ' / ' + df_full.year.astype(str)
vc = df_full.winery_year.value_counts()
df_full.winery_year = df_full.winery_year.replace(vc[vc < 2].index, 'Other')
df_full.winery_year = df_full.winery_year.fillna('Other')
print('Number of unique winery + year:', len(df_full.winery_year.unique()))
X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location', 'winery_designation', 'year', 'winery_year'])
X_train.shape, X_test.shape, y_train.shape, y_test.shape
train_and_cv(MODEL, X_train, y_train, X_test, y_test);
We've got minor improvement in holdout score, but worse CV score.
Since this feature also inroduce a lot of dummy values, let's not add it to the model.
X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location', 'winery_designation', 'year'])
X_train.shape, X_test.shape, y_train.shape, y_test.shape
train_and_cv(MODEL, X_train, y_train, X_test, y_test)
Now our model is ready to be used or improved further.
We can train it on the whole dataset (train+test) to get better results in real use, as our learning curve suggested.
We've reached our goal on building a model which may predict the wine rating based on the wine features and textual description.
Models of such type can be used in the wine industry to predict wine ratings and augment predictions of an experts to find and resolve their biases.
Also it is possible to deploy this model in a form of a web or mobile application to allow wine buyers to get predictions for the random wines in the local stores (they can use wine description on the bottle in this case).
However, the following things can be tried to improve the model:
Choose the best wines and drink responsibly! :)