Domain knowledge is critical to getting the best out of data analysis and machine learning. In the case of basketball, Dean Oliver identified four factors that are critical to success:
Of course, it is not enough to identify factors, you need a way to measure them.
Read this article about the four factors and how they are measured. In this notebook, we will compute them from the box score data. The numbers are slightly different from that of the article because the article is about the NBA, but these numbers are Dean Oliver's variants for NCAA games.
Shooting is measured as the fraction of field goal attempts made, weighting 3 points higher:
$(FG + 0.5 * 3P) / FGA$
Let's compute the offensive and defensive shooting efficiency and see how correlated they are to winning teams.
See %%bigquery documentation for how to use it.
%%bigquery df1
SELECT
team_code,
AVG(SAFE_DIVIDE(fgm + 0.5 * fgm3,fga)) AS offensive_shooting_efficiency,
AVG(SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga)) AS opponents_shooting_efficiency,
AVG(win) AS win_rate,
COUNT(win) AS num_games
FROM lab_dev.team_box
WHERE fga IS NOT NULL
GROUP BY team_code
team_code | offensive_shooting_efficiency | opponents_shooting_efficiency | win_rate | num_games | |
---|---|---|---|---|---|
0 | 272 | 0.506978 | 0.516485 | 0.426626 | 661 |
1 | 632 | 0.473055 | 0.504116 | 0.429664 | 654 |
2 | 178 | 0.469923 | 0.512281 | 0.320840 | 667 |
3 | 504980 | 0.461662 | 0.567646 | 0.000000 | 25 |
4 | 183 | 0.544678 | 0.519526 | 0.476190 | 504 |
5 | 28 | 0.516613 | 0.503602 | 0.548433 | 702 |
6 | 483 | 0.504365 | 0.513276 | 0.443769 | 658 |
7 | 7 | 0.461051 | 0.500835 | 0.413445 | 595 |
8 | 664 | 0.500198 | 0.509669 | 0.520000 | 675 |
9 | 204 | 0.483567 | 0.473647 | 0.530864 | 729 |
10 | 229 | 0.479087 | 0.487914 | 0.382309 | 667 |
11 | 158 | 0.513192 | 0.503893 | 0.467949 | 624 |
12 | 136 | 0.444162 | 0.542146 | 0.202683 | 671 |
13 | 141 | 0.506201 | 0.545918 | 0.337719 | 684 |
14 | 402 | 0.481683 | 0.508385 | 0.385321 | 654 |
15 | 676 | 0.527071 | 0.493973 | 0.748201 | 695 |
16 | 172 | 0.496961 | 0.518819 | 0.343154 | 577 |
17 | 2678 | 0.467513 | 0.508301 | 0.345312 | 640 |
18 | 231 | 0.495587 | 0.501348 | 0.417630 | 692 |
19 | 699 | 0.499126 | 0.502554 | 0.517974 | 612 |
20 | 471 | 0.506805 | 0.493353 | 0.513514 | 629 |
21 | 554 | 0.532023 | 0.486887 | 0.646865 | 606 |
22 | 288 | 0.512249 | 0.475227 | 0.633001 | 703 |
23 | 101 | 0.489015 | 0.517949 | 0.340989 | 566 |
24 | 1334 | 0.430077 | 0.557402 | 0.069767 | 43 |
25 | 572 | 0.493080 | 0.482184 | 0.572183 | 568 |
26 | 742 | 0.524123 | 0.491863 | 0.552743 | 711 |
27 | 782 | 0.519189 | 0.464462 | 0.779202 | 702 |
28 | 721 | 0.513223 | 0.523255 | 0.417981 | 634 |
29 | 697 | 0.501382 | 0.474652 | 0.580210 | 667 |
... | ... | ... | ... | ... | ... |
1134 | 30036 | 0.406250 | 0.690715 | 0.000000 | 3 |
1135 | 199 | 0.417882 | 0.612814 | 0.000000 | 6 |
1136 | 30232 | 0.344412 | 0.632807 | 0.000000 | 6 |
1137 | 8911 | 0.430807 | 0.608784 | 0.000000 | 6 |
1138 | 119 | 0.443858 | 0.578475 | 0.000000 | 3 |
1139 | 506514 | 0.313492 | 0.606239 | 0.000000 | 9 |
1140 | 1079 | 0.424555 | 0.525222 | 0.000000 | 3 |
1141 | 20 | 0.400667 | 0.480387 | 0.000000 | 3 |
1142 | 24400 | 0.435043 | 0.761804 | 0.000000 | 3 |
1143 | 30209 | 0.581699 | 0.611641 | 0.000000 | 3 |
1144 | 30222 | 0.420286 | 0.626385 | 0.000000 | 3 |
1145 | 506332 | 0.399307 | 0.433187 | 0.000000 | 3 |
1146 | 18422 | 0.376947 | 0.698700 | 0.000000 | 3 |
1147 | 503863 | 0.372083 | 0.679280 | 0.000000 | 3 |
1148 | 506495 | 0.365774 | 0.630245 | 0.000000 | 3 |
1149 | 30165 | 0.256732 | 0.514999 | 0.000000 | 3 |
1150 | 506534 | 0.403608 | 0.472195 | 0.000000 | 3 |
1151 | 803 | 0.301586 | 0.640208 | 0.000000 | 3 |
1152 | 16142 | 0.403827 | 0.681255 | 0.000000 | 3 |
1153 | 9684 | 0.353597 | 0.535714 | 0.000000 | 3 |
1154 | 506190 | 0.469890 | 0.771429 | 0.000000 | 3 |
1155 | 506063 | 0.453915 | 0.507163 | 0.000000 | 3 |
1156 | 681 | 0.395522 | 0.669123 | 0.000000 | 3 |
1157 | 506469 | 0.409970 | 0.614945 | 0.000000 | 3 |
1158 | 506232 | 0.318842 | 0.615385 | 0.000000 | 3 |
1159 | 504906 | 0.418800 | 0.576923 | 0.000000 | 3 |
1160 | 507 | 0.558824 | 0.384615 | 1.000000 | 1 |
1161 | 2814 | 0.491995 | 0.428956 | 1.000000 | 2 |
1162 | 500615 | 0.500000 | 0.378571 | 1.000000 | 1 |
1163 | 9013 | 0.453704 | 0.306818 | 1.000000 | 1 |
1164 rows × 5 columns
Let's remove the entries corresponding to teams that played fewer than 100 games, and then plot it.
df1 = df1[df1['num_games'] > 100]
df1.plot(x='offensive_shooting_efficiency', y='win_rate', style='o');
df1.plot(x='opponents_shooting_efficiency', y='win_rate', style='o');
Does the relationship make sense? Do you think offensive and defensive efficiency are good predictors of a team's performance?
Turnover percentage is measured as:
$TOV / (FGA + 0.475 * FTA + TOV - OREB)$
As before, let's compute this, and see whether it is a good predictor. For simplicity, we will compute only offensive turnover percentage, although we should really compute both sides as we did for scoring efficiency.
%%bigquery df2
SELECT
team_code,
AVG(SAFE_DIVIDE(tov,fga+0.475*fta+tov-oreb)) AS turnover_percent,
AVG(win) AS win_rate,
COUNT(win) AS num_games
FROM lab_dev.team_box
WHERE fga IS NOT NULL
GROUP BY team_code
HAVING num_games > 100
team_code | turnover_percent | win_rate | num_games | |
---|---|---|---|---|
0 | 272 | 0.191072 | 0.426626 | 661 |
1 | 632 | 0.212263 | 0.429664 | 654 |
2 | 178 | 0.197997 | 0.320840 | 667 |
3 | 183 | 0.196554 | 0.476190 | 504 |
4 | 28 | 0.174868 | 0.548433 | 702 |
5 | 483 | 0.202556 | 0.443769 | 658 |
6 | 7 | 0.202094 | 0.413445 | 595 |
7 | 664 | 0.183439 | 0.520000 | 675 |
8 | 204 | 0.185664 | 0.530864 | 729 |
9 | 229 | 0.186421 | 0.382309 | 667 |
10 | 158 | 0.182368 | 0.467949 | 624 |
11 | 136 | 0.205889 | 0.202683 | 671 |
12 | 141 | 0.185401 | 0.337719 | 684 |
13 | 402 | 0.183679 | 0.385321 | 654 |
14 | 676 | 0.204146 | 0.748201 | 695 |
15 | 172 | 0.199702 | 0.343154 | 577 |
16 | 2678 | 0.229683 | 0.345312 | 640 |
17 | 231 | 0.194933 | 0.417630 | 692 |
18 | 699 | 0.192619 | 0.517974 | 612 |
19 | 471 | 0.190303 | 0.513514 | 629 |
20 | 554 | 0.174488 | 0.646865 | 606 |
21 | 288 | 0.175029 | 0.633001 | 703 |
22 | 101 | 0.188541 | 0.340989 | 566 |
23 | 572 | 0.182393 | 0.572183 | 568 |
24 | 742 | 0.176804 | 0.552743 | 711 |
25 | 782 | 0.166937 | 0.779202 | 702 |
26 | 721 | 0.190736 | 0.417981 | 634 |
27 | 697 | 0.192820 | 0.580210 | 667 |
28 | 725 | 0.190195 | 0.465774 | 672 |
29 | 386 | 0.196228 | 0.285491 | 641 |
... | ... | ... | ... | ... |
321 | 716 | 0.183046 | 0.424818 | 685 |
322 | 328 | 0.181668 | 0.813538 | 783 |
323 | 488 | 0.209874 | 0.356932 | 678 |
324 | 738 | 0.184885 | 0.704802 | 708 |
325 | 711 | 0.190054 | 0.492997 | 714 |
326 | 514 | 0.173974 | 0.564560 | 728 |
327 | 392 | 0.190724 | 0.680224 | 713 |
328 | 305 | 0.187948 | 0.502146 | 699 |
329 | 260 | 0.166279 | 0.859211 | 760 |
330 | 553 | 0.202890 | 0.385399 | 589 |
331 | 371 | 0.195883 | 0.562412 | 713 |
332 | 28600 | 0.193170 | 0.556492 | 593 |
333 | 472 | 0.194620 | 0.731973 | 735 |
334 | 457 | 0.169430 | 0.750968 | 775 |
335 | 355 | 0.195786 | 0.501408 | 710 |
336 | 416 | 0.187034 | 0.738622 | 769 |
337 | 17 | 0.205061 | 0.355200 | 625 |
338 | 556 | 0.175228 | 0.593496 | 738 |
339 | 80 | 0.194060 | 0.397380 | 458 |
340 | 14927 | 0.186778 | 0.711409 | 596 |
341 | 741 | 0.180576 | 0.345576 | 599 |
342 | 650 | 0.168030 | 0.501916 | 522 |
343 | 334 | 0.171997 | 0.810710 | 803 |
344 | 688 | 0.177524 | 0.681392 | 747 |
345 | 1104 | 0.177548 | 0.647287 | 516 |
346 | 464 | 0.170675 | 0.481720 | 465 |
347 | 505 | 0.181377 | 0.538062 | 578 |
348 | 2 | 0.197957 | 0.462247 | 543 |
349 | 368 | 0.200837 | 0.409836 | 488 |
350 | 2743 | 0.193202 | 0.483301 | 509 |
351 rows × 4 columns
df2.plot(x='turnover_percent', y='win_rate', style='o');
Again, we'd have to measure both sides, but for simplicity, we'll do only the offensive rebounds.
$ORB / (ORB + Opp DRB)$
%%bigquery df3
SELECT
team_code,
AVG(SAFE_DIVIDE(oreb,oreb + opp_dreb)) AS rebounding,
AVG(win) AS win_rate,
COUNT(win) AS num_games
FROM lab_dev.team_box
WHERE fga IS NOT NULL
GROUP BY team_code
HAVING num_games > 100
team_code | rebounding | win_rate | num_games | |
---|---|---|---|---|
0 | 272 | 0.241530 | 0.426626 | 661 |
1 | 632 | 0.293107 | 0.429664 | 654 |
2 | 178 | 0.297945 | 0.320840 | 667 |
3 | 183 | 0.245612 | 0.476190 | 504 |
4 | 28 | 0.283190 | 0.548433 | 702 |
5 | 483 | 0.276217 | 0.443769 | 658 |
6 | 7 | 0.336807 | 0.413445 | 595 |
7 | 664 | 0.287098 | 0.520000 | 675 |
8 | 204 | 0.322037 | 0.530864 | 729 |
9 | 229 | 0.291606 | 0.382309 | 667 |
10 | 158 | 0.289716 | 0.467949 | 624 |
11 | 136 | 0.282258 | 0.202683 | 671 |
12 | 141 | 0.283392 | 0.337719 | 684 |
13 | 402 | 0.314090 | 0.385321 | 654 |
14 | 676 | 0.352438 | 0.748201 | 695 |
15 | 172 | 0.287586 | 0.343154 | 577 |
16 | 2678 | 0.282616 | 0.345312 | 640 |
17 | 231 | 0.295516 | 0.417630 | 692 |
18 | 699 | 0.315440 | 0.517974 | 612 |
19 | 471 | 0.287515 | 0.513514 | 629 |
20 | 554 | 0.255758 | 0.646865 | 606 |
21 | 288 | 0.331052 | 0.633001 | 703 |
22 | 101 | 0.297901 | 0.340989 | 566 |
23 | 572 | 0.316834 | 0.572183 | 568 |
24 | 742 | 0.278543 | 0.552743 | 711 |
25 | 782 | 0.337313 | 0.779202 | 702 |
26 | 721 | 0.253597 | 0.417981 | 634 |
27 | 697 | 0.333753 | 0.580210 | 667 |
28 | 725 | 0.284064 | 0.465774 | 672 |
29 | 386 | 0.245261 | 0.285491 | 641 |
... | ... | ... | ... | ... |
321 | 716 | 0.303904 | 0.424818 | 685 |
322 | 328 | 0.332755 | 0.813538 | 783 |
323 | 488 | 0.303440 | 0.356932 | 678 |
324 | 738 | 0.299944 | 0.704802 | 708 |
325 | 711 | 0.364667 | 0.492997 | 714 |
326 | 514 | 0.294900 | 0.564560 | 728 |
327 | 392 | 0.316788 | 0.680224 | 713 |
328 | 305 | 0.263030 | 0.502146 | 699 |
329 | 260 | 0.320352 | 0.859211 | 760 |
330 | 553 | 0.303100 | 0.385399 | 589 |
331 | 371 | 0.252450 | 0.562412 | 713 |
332 | 28600 | 0.289313 | 0.556492 | 593 |
333 | 472 | 0.362197 | 0.731973 | 735 |
334 | 457 | 0.384203 | 0.750968 | 775 |
335 | 355 | 0.270435 | 0.501408 | 710 |
336 | 416 | 0.341266 | 0.738622 | 769 |
337 | 17 | 0.302730 | 0.355200 | 625 |
338 | 556 | 0.325347 | 0.593496 | 738 |
339 | 80 | 0.276110 | 0.397380 | 458 |
340 | 14927 | 0.281573 | 0.711409 | 596 |
341 | 741 | 0.276466 | 0.345576 | 599 |
342 | 650 | 0.267929 | 0.501916 | 522 |
343 | 334 | 0.372246 | 0.810710 | 803 |
344 | 688 | 0.338079 | 0.681392 | 747 |
345 | 1104 | 0.295409 | 0.647287 | 516 |
346 | 464 | 0.272608 | 0.481720 | 465 |
347 | 505 | 0.292021 | 0.538062 | 578 |
348 | 2 | 0.270286 | 0.462247 | 543 |
349 | 368 | 0.255851 | 0.409836 | 488 |
350 | 2743 | 0.259266 | 0.483301 | 509 |
351 rows × 4 columns
df3.plot(x='rebounding', y='win_rate', style='o');
The relationship doesn't seem all that strong here. One way to measure the strength of the relationship is through the correlation. Numbers near 0 mean not correlated and numbers near +/- 1 indicate high correlation:
df3.corr()['win_rate']
team_code 0.054005 rebounding 0.377354 win_rate 1.000000 num_games 0.464207 Name: win_rate, dtype: float64
The correlation between rebounding and win_rate is 0.38. Compare that to the first data frame:
df1.corr()['win_rate']
team_code 0.054005 offensive_shooting_efficiency 0.671579 opponents_shooting_efficiency -0.742178 win_rate 1.000000 num_games 0.464207 Name: win_rate, dtype: float64
Notice that the offensive and opponents efficiency have correlation of 0.67 and -0.66, which are higher.
df2.corr()['win_rate']
team_code 0.054005 turnover_percent -0.551083 win_rate 1.000000 num_games 0.464207 Name: win_rate, dtype: float64
This is a measure of both how often a team gets to the line and how often they make them:
$FT / FGA$
%%bigquery df3
SELECT
team_code,
AVG(SAFE_DIVIDE(ftm,fga)) AS freethrows,
AVG(win) AS win_rate,
COUNT(win) AS num_games
FROM lab_dev.team_box
WHERE fga IS NOT NULL
GROUP BY team_code
HAVING num_games > 100
team_code | freethrows | win_rate | num_games | |
---|---|---|---|---|
0 | 272 | 0.243652 | 0.426626 | 661 |
1 | 632 | 0.237715 | 0.429664 | 654 |
2 | 178 | 0.222857 | 0.320840 | 667 |
3 | 183 | 0.254770 | 0.476190 | 504 |
4 | 28 | 0.288545 | 0.548433 | 702 |
5 | 483 | 0.274948 | 0.443769 | 658 |
6 | 7 | 0.248667 | 0.413445 | 595 |
7 | 664 | 0.261282 | 0.520000 | 675 |
8 | 204 | 0.272575 | 0.530864 | 729 |
9 | 229 | 0.231934 | 0.382309 | 667 |
10 | 158 | 0.248947 | 0.467949 | 624 |
11 | 136 | 0.221863 | 0.202683 | 671 |
12 | 141 | 0.257229 | 0.337719 | 684 |
13 | 402 | 0.253590 | 0.385321 | 654 |
14 | 676 | 0.289181 | 0.748201 | 695 |
15 | 172 | 0.243333 | 0.343154 | 577 |
16 | 2678 | 0.252833 | 0.345312 | 640 |
17 | 231 | 0.266086 | 0.417630 | 692 |
18 | 699 | 0.310747 | 0.517974 | 612 |
19 | 471 | 0.264406 | 0.513514 | 629 |
20 | 554 | 0.267905 | 0.646865 | 606 |
21 | 288 | 0.249748 | 0.633001 | 703 |
22 | 101 | 0.303917 | 0.340989 | 566 |
23 | 572 | 0.272403 | 0.572183 | 568 |
24 | 742 | 0.283348 | 0.552743 | 711 |
25 | 782 | 0.280968 | 0.779202 | 702 |
26 | 721 | 0.249609 | 0.417981 | 634 |
27 | 697 | 0.252731 | 0.580210 | 667 |
28 | 725 | 0.220399 | 0.465774 | 672 |
29 | 386 | 0.283397 | 0.285491 | 641 |
... | ... | ... | ... | ... |
321 | 716 | 0.264069 | 0.424818 | 685 |
322 | 328 | 0.294445 | 0.813538 | 783 |
323 | 488 | 0.288288 | 0.356932 | 678 |
324 | 738 | 0.298986 | 0.704802 | 708 |
325 | 711 | 0.315352 | 0.492997 | 714 |
326 | 514 | 0.301645 | 0.564560 | 728 |
327 | 392 | 0.298517 | 0.680224 | 713 |
328 | 305 | 0.294915 | 0.502146 | 699 |
329 | 260 | 0.285509 | 0.859211 | 760 |
330 | 553 | 0.246203 | 0.385399 | 589 |
331 | 371 | 0.257085 | 0.562412 | 713 |
332 | 28600 | 0.302776 | 0.556492 | 593 |
333 | 472 | 0.299636 | 0.731973 | 735 |
334 | 457 | 0.248660 | 0.750968 | 775 |
335 | 355 | 0.269752 | 0.501408 | 710 |
336 | 416 | 0.259399 | 0.738622 | 769 |
337 | 17 | 0.260396 | 0.355200 | 625 |
338 | 556 | 0.303351 | 0.593496 | 738 |
339 | 80 | 0.277944 | 0.397380 | 458 |
340 | 14927 | 0.249898 | 0.711409 | 596 |
341 | 741 | 0.226334 | 0.345576 | 599 |
342 | 650 | 0.270419 | 0.501916 | 522 |
343 | 334 | 0.318962 | 0.810710 | 803 |
344 | 688 | 0.276128 | 0.681392 | 747 |
345 | 1104 | 0.306530 | 0.647287 | 516 |
346 | 464 | 0.273508 | 0.481720 | 465 |
347 | 505 | 0.246932 | 0.538062 | 578 |
348 | 2 | 0.258238 | 0.462247 | 543 |
349 | 368 | 0.263038 | 0.409836 | 488 |
350 | 2743 | 0.325863 | 0.483301 | 509 |
351 rows × 4 columns
df3.plot(x='freethrows', y='win_rate', style='o');
df3.corr()['win_rate']
team_code 0.054005 freethrows 0.353668 win_rate 1.000000 num_games 0.464207 Name: win_rate, dtype: float64
Let's use these factors to create a simple ML model
%%bigquery
SELECT
team_code,
is_home,
SAFE_DIVIDE(fgm + 0.5 * fgm3,fga) AS offensive_shooting_efficiency,
SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga) AS opponents_shooting_efficiency,
SAFE_DIVIDE(tov,fga+0.475*fta+tov-oreb) AS turnover_percent,
SAFE_DIVIDE(opp_tov,opp_fga+0.475*opp_fta+opp_tov-opp_oreb) AS opponents_turnover_percent,
SAFE_DIVIDE(oreb,oreb + opp_dreb) AS rebounding,
SAFE_DIVIDE(opp_oreb,opp_oreb + dreb) AS opponents_rebounding,
SAFE_DIVIDE(ftm,fga) AS freethrows,
SAFE_DIVIDE(opp_ftm,opp_fga) AS opponents_freethrows,
win
FROM lab_dev.team_box
WHERE fga IS NOT NULL and win IS NOT NULL
LIMIT 10
team_code | is_home | offensive_shooting_efficiency | opponents_shooting_efficiency | turnover_percent | opponents_turnover_percent | rebounding | opponents_rebounding | freethrows | opponents_freethrows | win | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 272 | 0 | 0.396226 | 0.645161 | 0.099875 | 0.291845 | 0.189189 | 0.235294 | 0.188679 | 0.774194 | 0 |
1 | 632 | 0 | 0.360000 | 0.532258 | 0.219873 | 0.358905 | 0.305556 | 0.434783 | 0.200000 | 0.774194 | 0 |
2 | 178 | 0 | 0.520000 | 0.651515 | 0.191388 | 0.290089 | 0.185185 | 0.350000 | 0.200000 | 0.878788 | 0 |
3 | 504980 | 0 | 0.463636 | 0.727273 | 0.198777 | 0.247966 | 0.378378 | 0.176471 | 0.254545 | 0.787879 | 0 |
4 | 183 | 0 | 0.405172 | 0.632353 | 0.140242 | 0.275638 | 0.214286 | 0.227273 | 0.172414 | 0.764706 | 0 |
5 | 28 | 0 | 0.526786 | 0.785714 | 0.098928 | 0.236088 | 0.266667 | 0.166667 | 0.196429 | 0.485714 | 0 |
6 | 183 | 0 | 0.608108 | 0.614286 | 0.383212 | 0.332564 | 0.350000 | 0.315789 | 0.189189 | 0.314286 | 0 |
7 | 183 | 0 | 0.397959 | 0.600000 | 0.207502 | 0.302775 | 0.176471 | 0.210526 | 0.204082 | 0.371429 | 0 |
8 | 183 | 0 | 0.559524 | 0.542857 | 0.317726 | 0.282158 | 0.208333 | 0.315789 | 0.142857 | 0.742857 | 0 |
9 | 183 | 0 | 0.597561 | 0.541667 | 0.201465 | 0.163339 | 0.200000 | 0.304348 | 0.243902 | 0.694444 | 0 |
%%bigquery
CREATE OR REPLACE MODEL lab_dev.four_factors_model
OPTIONS(model_type='logistic_reg', input_label_cols=['win'])
AS
SELECT
team_code,
is_home,
SAFE_DIVIDE(fgm + 0.5 * fgm3,fga) AS offensive_shooting_efficiency,
SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga) AS opponents_shooting_efficiency,
SAFE_DIVIDE(tov,fga+0.475*fta+tov-oreb) AS turnover_percent,
SAFE_DIVIDE(opp_tov,opp_fga+0.475*opp_fta+opp_tov-opp_oreb) AS opponents_turnover_percent,
SAFE_DIVIDE(oreb,oreb + opp_dreb) AS rebounding,
SAFE_DIVIDE(opp_oreb,opp_oreb + dreb) AS opponents_rebounding,
SAFE_DIVIDE(ftm,fga) AS freethrows,
SAFE_DIVIDE(opp_ftm,opp_fga) AS opponents_freethrows,
win
FROM lab_dev.team_box
WHERE fga IS NOT NULL and win IS NOT NULL
%%bigquery
SELECT * FROM ML.EVALUATE(MODEL lab_dev.four_factors_model)
precision | recall | accuracy | f1_score | log_loss | roc_auc | |
---|---|---|---|---|---|---|
0 | 0.869986 | 0.872272 | 0.871723 | 0.871127 | 0.326231 | 0.93667 |
87% isn't bad, but ... there is a huge problem with the above approach. How are we supposed to know Team A's free throw shooting percentage against Team B before the game is played?
What we could do is to get the free throw shooting percentage of Team A in the 3 games prior to this one and use that. This requires analytic functions in SQL. If you are not familar with these, make a copy of the select statement and modify it in stages until you grasp what is happening.
%%bigquery
CREATE OR REPLACE MODEL lab_dev.four_factors_model
OPTIONS(model_type='logistic_reg', input_label_cols=['win'])
AS
WITH all_games AS (
SELECT
game_date,
team_code,
is_home,
SAFE_DIVIDE(fgm + 0.5 * fgm3,fga) AS offensive_shooting_efficiency,
SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga) AS opponents_shooting_efficiency,
SAFE_DIVIDE(tov,fga+0.475*fta+tov-oreb) AS turnover_percent,
SAFE_DIVIDE(opp_tov,opp_fga+0.475*opp_fta+opp_tov-opp_oreb) AS opponents_turnover_percent,
SAFE_DIVIDE(oreb,oreb + opp_dreb) AS rebounding,
SAFE_DIVIDE(opp_oreb,opp_oreb + dreb) AS opponents_rebounding,
SAFE_DIVIDE(ftm,fga) AS freethrows,
SAFE_DIVIDE(opp_ftm,opp_fga) AS opponents_freethrows,
win
FROM lab_dev.team_box
WHERE fga IS NOT NULL and win IS NOT NULL
)
, prevgames AS (
SELECT
is_home,
AVG(offensive_shooting_efficiency)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS offensive_shooting_efficiency,
AVG(opponents_shooting_efficiency)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING)AS opponents_shooting_efficiency,
AVG(turnover_percent)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS turnover_percent,
AVG(opponents_turnover_percent)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS opponents_turnover_percent,
AVG(rebounding)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS rebounding,
AVG(opponents_rebounding)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS opponents_rebounding,
AVG(freethrows)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS freethrows,
AVG(opponents_freethrows)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS oppponents_freethrows,
win
FROM all_games
)
SELECT * FROM prevgames
WHERE offensive_shooting_efficiency IS NOT NULL
%%bigquery
SELECT * FROM ML.EVALUATE(MODEL lab_dev.four_factors_model)
precision | recall | accuracy | f1_score | log_loss | roc_auc | |
---|---|---|---|---|---|---|
0 | 0.694751 | 0.697642 | 0.693924 | 0.696194 | 0.578669 | 0.764541 |
Based on just the teams' performance coming in, we can predict the outcome of games with a 69.4% accuracy.
We can write a more complex ML model using Keras and a deep neural network. The code is not that hard but you'll have to do a lot more work (scaling, hyperparameter tuning) to get better performance than you did with the BigQuery ML model.
%%bigquery games
WITH all_games AS (
SELECT
game_date,
team_code,
is_home,
SAFE_DIVIDE(fgm + 0.5 * fgm3,fga) AS offensive_shooting_efficiency,
SAFE_DIVIDE(opp_fgm + 0.5 * opp_fgm3,opp_fga) AS opponents_shooting_efficiency,
SAFE_DIVIDE(tov,fga+0.475*fta+tov-oreb) AS turnover_percent,
SAFE_DIVIDE(opp_tov,opp_fga+0.475*opp_fta+opp_tov-opp_oreb) AS opponents_turnover_percent,
SAFE_DIVIDE(oreb,oreb + opp_dreb) AS rebounding,
SAFE_DIVIDE(opp_oreb,opp_oreb + dreb) AS opponents_rebounding,
SAFE_DIVIDE(ftm,fga) AS freethrows,
SAFE_DIVIDE(opp_ftm,opp_fga) AS opponents_freethrows,
win
FROM lab_dev.team_box
WHERE fga IS NOT NULL and win IS NOT NULL
)
, prevgames AS (
SELECT
is_home,
AVG(offensive_shooting_efficiency)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS offensive_shooting_efficiency,
AVG(opponents_shooting_efficiency)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING)AS opponents_shooting_efficiency,
AVG(turnover_percent)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS turnover_percent,
AVG(opponents_turnover_percent)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS opponents_turnover_percent,
AVG(rebounding)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS rebounding,
AVG(opponents_rebounding)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS opponents_rebounding,
AVG(freethrows)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS freethrows,
AVG(opponents_freethrows)
OVER(PARTITION BY team_code ORDER BY game_date ASC ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS oppponents_freethrows,
win
FROM all_games
)
SELECT * FROM prevgames
WHERE offensive_shooting_efficiency IS NOT NULL
is_home | offensive_shooting_efficiency | opponents_shooting_efficiency | turnover_percent | opponents_turnover_percent | rebounding | opponents_rebounding | freethrows | oppponents_freethrows | win | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0.404762 | 0.577586 | 0.129534 | 0.132319 | 0.255814 | 0.312500 | 0.380952 | 0.465517 | 0 |
1 | 0 | 0.485714 | 0.564655 | 0.117888 | 0.135268 | 0.221657 | 0.267361 | 0.390476 | 0.370690 | 0 |
2 | 0 | 0.409668 | 0.577586 | 0.129161 | 0.132480 | 0.246537 | 0.321098 | 0.381530 | 0.465517 | 0 |
3 | 0 | 0.407251 | 0.549261 | 0.172692 | 0.135533 | 0.268236 | 0.369395 | 0.379897 | 0.451817 | 0 |
4 | 0 | 0.396338 | 0.529865 | 0.198496 | 0.102454 | 0.245949 | 0.396533 | 0.409659 | 0.403619 | 0 |
5 | 0 | 0.362626 | 0.495194 | 0.265873 | 0.139328 | 0.332407 | 0.497227 | 0.377841 | 0.486827 | 0 |
6 | 0 | 0.371149 | 0.473499 | 0.311279 | 0.147634 | 0.291667 | 0.508506 | 0.286932 | 0.381368 | 0 |
7 | 0 | 0.396149 | 0.493297 | 0.275781 | 0.202785 | 0.263889 | 0.484101 | 0.193182 | 0.322167 | 0 |
8 | 0 | 0.407723 | 0.500373 | 0.280093 | 0.269113 | 0.267677 | 0.491741 | 0.068182 | 0.305872 | 0 |
9 | 0 | 0.376692 | 0.547112 | 0.243627 | 0.218389 | 0.171843 | 0.371205 | 0.057692 | 0.237032 | 0 |
10 | 0 | 0.370739 | 0.615526 | 0.241415 | 0.192551 | 0.221843 | 0.320966 | 0.093407 | 0.243215 | 0 |
11 | 0 | 0.317498 | 0.653836 | 0.270233 | 0.118722 | 0.227263 | 0.266799 | 0.139703 | 0.273507 | 0 |
12 | 0 | 0.321416 | 0.657631 | 0.253746 | 0.068291 | 0.273914 | 0.220563 | 0.264703 | 0.373794 | 0 |
13 | 0 | 0.361993 | 0.630026 | 0.247155 | 0.071182 | 0.312803 | 0.243672 | 0.292011 | 0.415461 | 0 |
14 | 0 | 0.425237 | 0.542446 | 0.221989 | 0.088116 | 0.288293 | 0.228122 | 0.297963 | 0.450944 | 1 |
15 | 0 | 0.466282 | 0.462653 | 0.236614 | 0.124585 | 0.361246 | 0.229592 | 0.404106 | 0.412888 | 1 |
16 | 0 | 0.476893 | 0.421784 | 0.258220 | 0.178508 | 0.381641 | 0.246259 | 0.330830 | 0.310714 | 1 |
17 | 0 | 0.495422 | 0.409388 | 0.269521 | 0.232157 | 0.384199 | 0.237436 | 0.304653 | 0.285714 | 1 |
18 | 0 | 0.492198 | 0.401888 | 0.278082 | 0.275053 | 0.421530 | 0.251877 | 0.318542 | 0.260714 | 0 |
19 | 0 | 0.509110 | 0.465395 | 0.234967 | 0.251346 | 0.311039 | 0.318054 | 0.213273 | 0.317105 | 0 |
20 | 0 | 0.522538 | 0.516946 | 0.199711 | 0.205544 | 0.215205 | 0.328470 | 0.200010 | 0.310209 | 0 |
21 | 1 | 0.516138 | 0.576589 | 0.185148 | 0.164143 | 0.165670 | 0.417359 | 0.196742 | 0.353066 | 0 |
22 | 1 | 0.498529 | 0.627498 | 0.154878 | 0.132432 | 0.169516 | 0.425286 | 0.234937 | 0.493975 | 0 |
23 | 1 | 0.499494 | 0.573770 | 0.158281 | 0.150213 | 0.163936 | 0.374306 | 0.233222 | 0.516453 | 0 |
24 | 0 | 0.487742 | 0.554936 | 0.164454 | 0.171188 | 0.217782 | 0.350556 | 0.268834 | 0.636393 | 0 |
25 | 0 | 0.445234 | 0.572793 | 0.154771 | 0.186478 | 0.256552 | 0.261667 | 0.228430 | 0.597703 | 0 |
26 | 0 | 0.426768 | 0.550336 | 0.140307 | 0.210642 | 0.203427 | 0.300303 | 0.202862 | 0.453222 | 0 |
27 | 1 | 0.384722 | 0.607281 | 0.126310 | 0.220407 | 0.244780 | 0.316970 | 0.193771 | 0.427528 | 1 |
28 | 1 | 0.406818 | 0.586706 | 0.126736 | 0.238385 | 0.299267 | 0.299601 | 0.195455 | 0.344841 | 1 |
29 | 1 | 0.466436 | 0.484733 | 0.130789 | 0.257743 | 0.339194 | 0.309125 | 0.223781 | 0.302078 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
242570 | 0 | 0.494877 | 0.636939 | 0.250341 | 0.106176 | 0.212605 | 0.447520 | 0.063715 | 0.154970 | 0 |
242571 | 0 | 0.401742 | 0.596582 | 0.222527 | 0.114765 | 0.187473 | 0.438682 | 0.118210 | 0.144701 | 0 |
242572 | 0 | 0.406288 | 0.565718 | 0.183735 | 0.121739 | 0.156888 | 0.462093 | 0.135634 | 0.160133 | 0 |
242573 | 0 | 0.410285 | 0.573655 | 0.187710 | 0.151244 | 0.180989 | 0.465588 | 0.180486 | 0.199094 | 0 |
242574 | 0 | 0.360669 | 0.601036 | 0.267922 | 0.165863 | 0.247436 | 0.426325 | 0.222152 | 0.197406 | 0 |
242575 | 0 | 0.462424 | 0.642155 | 0.274116 | 0.190360 | 0.304081 | 0.441699 | 0.228002 | 0.257028 | 0 |
242576 | 0 | 0.459701 | 0.713249 | 0.281827 | 0.231361 | 0.283530 | 0.433245 | 0.217349 | 0.248300 | 0 |
242577 | 0 | 0.423924 | 0.731165 | 0.263945 | 0.227899 | 0.217062 | 0.421066 | 0.172498 | 0.199260 | 0 |
242578 | 0 | 0.285714 | 0.600000 | 0.284775 | 0.131291 | 0.083333 | 0.458333 | 0.321429 | 0.177778 | 0 |
242579 | 0 | 0.318913 | 0.583333 | 0.241916 | 0.128292 | 0.089744 | 0.437500 | 0.259306 | 0.216667 | 0 |
242580 | 0 | 0.306250 | 0.542683 | 0.277932 | 0.173348 | 0.271186 | 0.395833 | 0.025000 | 0.231707 | 0 |
242581 | 0 | 0.300561 | 0.551829 | 0.281994 | 0.142076 | 0.308007 | 0.371830 | 0.038141 | 0.250000 | 0 |
242582 | 0 | 0.306065 | 0.542683 | 0.278273 | 0.172442 | 0.272005 | 0.394553 | 0.025427 | 0.231707 | 0 |
242583 | 0 | 0.324143 | 0.556318 | 0.272739 | 0.181963 | 0.252080 | 0.386824 | 0.039341 | 0.250169 | 0 |
242584 | 0 | 0.326806 | 0.565868 | 0.263222 | 0.190830 | 0.244469 | 0.379329 | 0.075344 | 0.258419 | 0 |
242585 | 0 | 0.315588 | 0.566249 | 0.246991 | 0.214809 | 0.229691 | 0.384478 | 0.128700 | 0.246033 | 0 |
242586 | 0 | 0.316448 | 0.610151 | 0.229092 | 0.191450 | 0.237726 | 0.338994 | 0.144726 | 0.254396 | 0 |
242587 | 0 | 0.292800 | 0.623346 | 0.227558 | 0.182843 | 0.252149 | 0.283799 | 0.124456 | 0.244673 | 0 |
242588 | 0 | 0.353846 | 0.657534 | 0.142626 | 0.163419 | 0.125000 | 0.375000 | 0.107692 | 0.109589 | 0 |
242589 | 0 | 0.392440 | 0.639578 | 0.099742 | 0.135365 | 0.110119 | 0.369318 | 0.174536 | 0.081822 | 0 |
242590 | 0 | 0.358849 | 0.657867 | 0.138008 | 0.161207 | 0.122795 | 0.375842 | 0.116357 | 0.110103 | 0 |
242591 | 0 | 0.355675 | 0.653556 | 0.159511 | 0.163903 | 0.103460 | 0.353310 | 0.173806 | 0.176327 | 0 |
242592 | 0 | 0.381499 | 0.633404 | 0.188832 | 0.161908 | 0.151756 | 0.426227 | 0.146883 | 0.206622 | 0 |
242593 | 0 | 0.404762 | 0.471831 | 0.311025 | 0.247678 | 0.227273 | 0.369565 | 0.190476 | 0.295775 | 0 |
242594 | 0 | 0.422078 | 0.465645 | 0.282698 | 0.275601 | 0.196970 | 0.424783 | 0.216450 | 0.296536 | 0 |
242595 | 0 | 0.372340 | 0.475806 | 0.302001 | 0.200215 | 0.176471 | 0.459459 | 0.170213 | 0.258065 | 0 |
242596 | 0 | 0.356625 | 0.429080 | 0.323144 | 0.186438 | 0.147059 | 0.457002 | 0.085106 | 0.246679 | 0 |
242597 | 0 | 0.371083 | 0.482482 | 0.302957 | 0.200104 | 0.176471 | 0.460224 | 0.163404 | 0.259691 | 0 |
242598 | 0 | 0.367255 | 0.518111 | 0.286675 | 0.187142 | 0.165248 | 0.446730 | 0.136976 | 0.253102 | 0 |
242599 | 0 | 0.362133 | 0.554332 | 0.263531 | 0.172676 | 0.146130 | 0.439008 | 0.103682 | 0.283413 | 0 |
242600 rows × 10 columns
import tensorflow as tf
import tensorflow.keras as keras
nrows = len(games)
ncols = len(games.iloc[0])
ntrain = (nrows * 7) // 10
print(nrows, ncols, ntrain)
242600 10 169820
# 0:ntrain are the training data; remaining rows are testing
# last col is the label
train_x = games.iloc[:ntrain, 0:(ncols-1)]
train_y = games.iloc[:ntrain, ncols-1]
test_x = games.iloc[ntrain:, 0:(ncols-1)]
test_y = games.iloc[ntrain:, ncols-1]
model = keras.models.Sequential()
model.add(keras.layers.Dense(5, input_dim=ncols-1, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer.
history = model.fit(train_x, train_y, epochs=5, batch_size=32)
score = model.evaluate(test_x, test_y, batch_size=512)
print(score)
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Epoch 1/5 169820/169820 [==============================] - 11s 65us/sample - loss: 0.6281 - acc: 0.6539 Epoch 2/5 169820/169820 [==============================] - 9s 53us/sample - loss: 0.5976 - acc: 0.6771 Epoch 3/5 169820/169820 [==============================] - 10s 56us/sample - loss: 0.5907 - acc: 0.6841 Epoch 4/5 169820/169820 [==============================] - 10s 58us/sample - loss: 0.5886 - acc: 0.6861 Epoch 5/5 169820/169820 [==============================] - 9s 52us/sample - loss: 0.5877 - acc: 0.6860 72780/72780 [==============================] - 0s 4us/sample - loss: 0.5494 - acc: 0.7151 [0.5493521325746779, 0.71512777]
With a deep neural network, we are able to get 71.5% accuracy using the four factors model.
# Copyright 2019 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.