#!/usr/bin/env python # coding: utf-8 # # Feature engineering on NCAA data # # Domain knowledge is critical to getting the best out of data analysis and machine learning. # In the case of basketball, Dean Oliver identified four factors that are critical to success: # * Shooting # * Turnovers # * Rebounding # * Free Throws # # Of course, it is not enough to identify factors, you need a way to measure them. # # Read [this article](https://www.basketball-reference.com/about/factors.html) about the four factors and how they are measured. In this notebook, we will compute them from the box score data. The numbers are slightly different from that of the article because the article is about the NBA, but these numbers are Dean Oliver's variants for NCAA games. # ## Shooting efficiency # # Shooting is measured as the fraction of field goal attempts made, weighting 3 points higher: # # $(FG + 0.5 * 3P) / FGA$ # # Let's compute the offensive and defensive shooting efficiency and see how correlated they are to winning teams. # # See [%%bigquery documentation](https://googleapis.github.io/google-cloud-python/latest/bigquery/magics.html) for how to use it. # In[ ]: get_ipython().run_cell_magic('bigquery', 'df1', 'SELECT \n team_code,\n AVG(SAFE_DIVIDE(fgm + 0.5 * fgm3,fga)) AS offensive_shooting_efficiency,\n AVG(SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga)) AS opponents_shooting_efficiency,\n AVG(win) AS win_rate,\n COUNT(win) AS num_games\nFROM lab_dev.team_box\nWHERE fga IS NOT NULL\nGROUP BY team_code\n') # Let's remove the entries corresponding to teams that played fewer than 100 games, and then plot it. # In[ ]: df1 = df1[df1['num_games'] > 100] # In[ ]: df1.plot(x='offensive_shooting_efficiency', y='win_rate', style='o'); # In[ ]: df1.plot(x='opponents_shooting_efficiency', y='win_rate', style='o'); # Does the relationship make sense? Do you think offensive and defensive efficiency are good predictors of a team's performance? # # It can be helpful to have a quantitive measure of the strength of the relationship. One way to measure the strength of the relationship is through the correlation. Numbers near 0 mean not correlated and numbers near +/- 1 indicate high correlation. # In[ ]: df1.corr()['win_rate'] # ## Turnover Percentage # # Turnover percentage is measured as: # # $TOV / (FGA + 0.475 * FTA + TOV - OREB)$ # # As before, let's compute this, and see whether it is a good predictor. For simplicity, we will compute only offensive turnover percentage, although we should really compute both sides as we did for scoring efficiency. # In[ ]: # ## Rebounding # # Again, we'd have to measure both sides, but for simplicity, we'll do only the offensive rebounds. # # $ORB / (ORB + Opp DRB)$ # In[ ]: # ## Free throw factor # # This is a measure of both how often a team gets to the line and how often they make them: # # $FT / FGA$ # # In[ ]: get_ipython().run_cell_magic('bigquery', 'df3', 'SELECT \n team_code,\n AVG(SAFE_DIVIDE(ftm,fga)) AS freethrows,\n AVG(win) AS win_rate,\n COUNT(win) AS num_games\nFROM lab_dev.team_box\nWHERE fga IS NOT NULL\nGROUP BY team_code\nHAVING num_games > 100\n') # ## Machine Learning # # Let's use these factors to create a simple ML model. Here's the dataset of features. Add the code to train the ML model. # In[ ]: get_ipython().run_cell_magic('bigquery', '', 'SELECT \n team_code,\n is_home,\n SAFE_DIVIDE(fgm + 0.5 * fgm3,fga) AS offensive_shooting_efficiency,\n SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga) AS opponents_shooting_efficiency,\n SAFE_DIVIDE(tov,fga+0.475*fta+tov-oreb) AS turnover_percent,\n SAFE_DIVIDE(opp_tov,opp_fga+0.475*opp_fta+opp_tov-opp_oreb) AS opponents_turnover_percent,\n SAFE_DIVIDE(oreb,oreb + opp_dreb) AS rebounding,\n SAFE_DIVIDE(opp_oreb,opp_oreb + dreb) AS opponents_rebounding,\n SAFE_DIVIDE(ftm,fga) AS freethrows,\n SAFE_DIVIDE(opp_ftm,opp_fga) AS opponents_freethrows,\n win\nFROM lab_dev.team_box\nWHERE fga IS NOT NULL and win IS NOT NULL\nLIMIT 10\n') # Is this correct, though? Will we know the offensive efficiency of the team before the game is played? How do we fix it? # # If you are stuck, please feel free to look at the solution notebook. # In[ ]: # Copyright 2019 Google Inc. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.