Domain knowledge is critical to getting the best out of data analysis and machine learning. In the case of basketball, Dean Oliver identified four factors that are critical to success:
Of course, it is not enough to identify factors, you need a way to measure them.
Read this article about the four factors and how they are measured. In this notebook, we will compute them from the box score data. The numbers are slightly different from that of the article because the article is about the NBA, but these numbers are Dean Oliver's variants for NCAA games.
Shooting is measured as the fraction of field goal attempts made, weighting 3 points higher:
$(FG + 0.5 * 3P) / FGA$
Let's compute the offensive and defensive shooting efficiency and see how correlated they are to winning teams.
See %%bigquery documentation for how to use it.
%%bigquery df1
SELECT
team_code,
AVG(SAFE_DIVIDE(fgm + 0.5 * fgm3,fga)) AS offensive_shooting_efficiency,
AVG(SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga)) AS opponents_shooting_efficiency,
AVG(win) AS win_rate,
COUNT(win) AS num_games
FROM lab_dev.team_box
WHERE fga IS NOT NULL
GROUP BY team_code
Let's remove the entries corresponding to teams that played fewer than 100 games, and then plot it.
df1 = df1[df1['num_games'] > 100]
df1.plot(x='offensive_shooting_efficiency', y='win_rate', style='o');
df1.plot(x='opponents_shooting_efficiency', y='win_rate', style='o');
Does the relationship make sense? Do you think offensive and defensive efficiency are good predictors of a team's performance?
It can be helpful to have a quantitive measure of the strength of the relationship. One way to measure the strength of the relationship is through the correlation. Numbers near 0 mean not correlated and numbers near +/- 1 indicate high correlation.
df1.corr()['win_rate']
Turnover percentage is measured as:
$TOV / (FGA + 0.475 * FTA + TOV - OREB)$
As before, let's compute this, and see whether it is a good predictor. For simplicity, we will compute only offensive turnover percentage, although we should really compute both sides as we did for scoring efficiency.
Again, we'd have to measure both sides, but for simplicity, we'll do only the offensive rebounds.
$ORB / (ORB + Opp DRB)$
This is a measure of both how often a team gets to the line and how often they make them:
$FT / FGA$
%%bigquery df3
SELECT
team_code,
AVG(SAFE_DIVIDE(ftm,fga)) AS freethrows,
AVG(win) AS win_rate,
COUNT(win) AS num_games
FROM lab_dev.team_box
WHERE fga IS NOT NULL
GROUP BY team_code
HAVING num_games > 100
Let's use these factors to create a simple ML model. Here's the dataset of features. Add the code to train the ML model.
%%bigquery
SELECT
team_code,
is_home,
SAFE_DIVIDE(fgm + 0.5 * fgm3,fga) AS offensive_shooting_efficiency,
SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga) AS opponents_shooting_efficiency,
SAFE_DIVIDE(tov,fga+0.475*fta+tov-oreb) AS turnover_percent,
SAFE_DIVIDE(opp_tov,opp_fga+0.475*opp_fta+opp_tov-opp_oreb) AS opponents_turnover_percent,
SAFE_DIVIDE(oreb,oreb + opp_dreb) AS rebounding,
SAFE_DIVIDE(opp_oreb,opp_oreb + dreb) AS opponents_rebounding,
SAFE_DIVIDE(ftm,fga) AS freethrows,
SAFE_DIVIDE(opp_ftm,opp_fga) AS opponents_freethrows,
win
FROM lab_dev.team_box
WHERE fga IS NOT NULL and win IS NOT NULL
LIMIT 10
Is this correct, though? Will we know the offensive efficiency of the team before the game is played? How do we fix it?
If you are stuck, please feel free to look at the solution notebook.
# Copyright 2019 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.