In [3]:

import pandas

board_games = pandas.read_csv("board_games.csv")
board_games = board_games.dropna(axis=0)
board_games = board_games[board_games["users_rated"] > 0]

board_games.head()

Out[3]:

	id	type	name	yearpublished	minplayers	maxplayers	playingtime	minplaytime	maxplaytime	minage	users_rated	average_rating	bayes_average_rating	total_owners	total_traders	total_wanters	total_wishers	total_comments	total_weights	average_weight
0	12333	boardgame	Twilight Struggle	2005	2	2	180	180	180	13	20113	8.33774	8.22186	26647	372	1219	5865	5347	2562	3.4785
1	120677	boardgame	Terra Mystica	2012	2	5	150	60	150	12	14383	8.28798	8.14232	16519	132	1586	6277	2526	1423	3.8939
2	102794	boardgame	Caverna: The Cave Farmers	2013	1	7	210	30	210	12	9262	8.28994	8.06886	12230	99	1476	5600	1700	777	3.7761
3	25613	boardgame	Through the Ages: A Story of Civilization	2006	2	4	240	240	240	12	13294	8.20407	8.05804	14343	362	1084	5075	3378	1642	4.1590
4	3076	boardgame	Puerto Rico	2002	2	5	150	90	150	12	39883	8.14261	8.04524	44362	795	861	5414	9173	5213	3.2943

In [4]:

%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(board_games["average_rating"])

Out[4]:

(array([   602.,   1231.,   2824.,   5206.,   8223.,  13593.,  13849.,
          8470.,   2224.,    672.]),
 array([  1. ,   1.9,   2.8,   3.7,   4.6,   5.5,   6.4,   7.3,   8.2,
          9.1,  10. ]),
 <a list of 10 Patch objects>)

In [5]:

print(board_games["average_rating"].std())
print(board_games["average_rating"].mean())

1.57882993483
6.01611284933

Error metric¶

In this data set, using mean squared error as an error metric makes sense. This is because the data is continuous, and follows a somewhat normal distribution. We'll be able to compare our error to the standard deviation to see how good the model is at predictions.

In [39]:

from sklearn.cluster import KMeans

clus = KMeans(n_clusters=5)
cols = list(board_games.columns)
cols.remove("name")
cols.remove("id")
cols.remove("type")
numeric = board_games[cols]

clus.fit(numeric)

Out[39]:

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=5, n_init=10,
    n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,
    verbose=0)

In [26]:

import numpy
game_mean = numeric.apply(numpy.mean, axis=1)
game_std = numeric.apply(numpy.std, axis=1)

In [27]:

labels = clus.labels_

plt.scatter(x=game_mean, y=game_std, c=labels)

Out[27]:

<matplotlib.collections.PathCollection at 0x10b5516d8>

Game clusters¶

It looks like most of the games are similar, but as the game attributes tend to increase in value (such as number of users who rated), there are fewer high quality games. So most games don't get played much, but a few get a lot of players.

In [29]:

correlations = numeric.corr()

correlations["average_rating"]

Out[29]:

yearpublished           0.108461
minplayers             -0.032701
maxplayers             -0.008335
playingtime             0.048994
minplaytime             0.043985
maxplaytime             0.048994
minage                  0.210049
users_rated             0.112564
average_rating          1.000000
bayes_average_rating    0.231563
total_owners            0.137478
total_traders           0.119452
total_wanters           0.196566
total_wishers           0.171375
total_comments          0.123714
total_weights           0.109691
average_weight          0.351081
Name: average_rating, dtype: float64

Correlations¶

The yearpublished column is surprisingly highly correlated with average_rating, showing that more recent games tend to be rated more highly. Games for older players (minage is high) tend to be more highly rated. The more "weighty" a game is (average_weight is high), the more highly it tends to be rated.

In [40]:

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
cols.remove("average_rating")
cols.remove("bayes_average_rating")
reg.fit(board_games[cols], board_games["average_rating"])
predictions = reg.predict(board_games[cols])

numpy.mean((predictions - board_games["average_rating"]) ** 2)

Out[40]:

2.0933969758339361

Game clusters¶

The error rate is close to the standard deviation of all board game ratings. This indicates that our model may not have high predictive power. We'll need to dig more into which games were scored well, and which ones weren't.

In [ ]: