In this series of tutorials, we are going to address how to do data scraping, data wrangling, data analytics and advaced plotting. This is a sport analytics case.
In this particular tutorial we address how to extract data or data scraping from the web. We also address some basic DataFrame
manipulation.
%matplotlib inline
import requests
import pandas as pd
In order to do data analytics, we need data. So the first thing to do is get the data from somewhere.
Hereafter we are going to work with NBA players data. In the NBA site stats.nba.com you will find all the statistics for every single player.
In the following blog post [1]
http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
you will find a pretty good explanation of how to get the data from stats.nba.com [2] (or any other site for this matter).
Remember, the most important part when scraping data from a web site, is knowing how to access the API used to collect the data.
To get Lebron James shot chart data we will use this url:
#Lebron James data
#PlayerID is the number in the nba.stats site
shot_chart_url = 'http://stats.nba.com/stats/shotchartdetail?CFID=33&CFPAR'\
'AMS=2014-15&ContextFilter=&ContextMeasure=FGA&DateFrom=&D'\
'ateTo=&GameID=&GameSegment=&LastNGames=0&LeagueID=00&Loca'\
'tion=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&'\
'PaceAdjust=N&PerMode=PerGame&Period=0&PlayerID=2544&Plu'\
'sMinus=N&Position=&Rank=N&RookieYear=&Season=2014-15&Seas'\
'onSegment=&SeasonType=Regular+Season&TeamID=0&VsConferenc'\
'e=&VsDivision=&mode=Advanced&showDetails=0&showShots=1&sh'\
'owZones=0'
Have in mind that the previous url corresponds to Lebron James shot chart data. If you want to access different data, another player or the data for a specifc team, you will need to find the right link.
By now, we already know that the file is in json format. If you have JSONView [3] installed in your web browser, you can view the content of the file. Just copy and paste the link in your web browser
Now let's use the module requests [4] to get the server data from shot_chart_url
.
requests is an Apache2 Licensed HTTP library, you can find more information about requests in the following link http://docs.python-requests.org/en/latest/#
# Get the webpage containing the data
response = requests.get(shot_chart_url)
We can view the server data header as follows:
response.headers
{'content-length': '22207', 'content-encoding': 'gzip', 'expires': 'Sun, 06 Sep 2015 15:06:20 GMT', 'vary': 'Accept-Encoding', 'server': 'Microsoft-IIS/7.5', 'last-modified': 'Sun, 06 Sep 2015 15:06:20 GMT', 'connection': 'keep-alive', 'pragma': 'no-cache', 'cache-control': 'no-cache, no-store, must-revalidate', 'date': 'Sun, 06 Sep 2015 15:06:20 GMT', 'x-powered-by': 'ASP.NET', 'content-type': 'application/json; charset=utf-8', 'x-aspnet-version': '4.0.30319'}
Now we can access the header data using any string we want, no need to say that the string must exist in the server headers:
response.headers['content-type']
'application/json; charset=utf-8'
As we can see, the data is in json format.
By the way, when we were doing web scraping we already knew that the data was in json format. I just showed you how to get that info from the header.
In the module requests, there is also a builtin JSON decoder. To print the content of responce
, we can proceed as follows
#uncomment this line to print the content of responce
#response.json()
#to time the function
#%time response.json()
Now we are ready to get the data we want in order to construct the pandas [5] DataFrame
.
# Grab the headers to be used as column headers for our DataFrame
headers = response.json()['resultSets'][0]['headers']
The data is in standard json format, and is made of three main blocks.
We are interested in getting the data from the block resultSets
, therefore the notation response.json()['resultSets']
The resultSets
block has two sub-blocks, we want to access the first one or the one with the name Shot_Chart_Detail
, therefore the notation response.json()['resultSets'][0]
Now we want to access the information contained in headers
, hence the notation response.json()['resultSets'][0]['headers']
The object response.json()['resultSets'][0]['headers']
contains the headers' names of each column.
To grab the shot chart data or Shot Chart Detail
in the json data, we proceed in a similar way. Have in mind that the data of interest is located in the block rowSet
. Therefore the notation response.json()['resultSets'][0]['rowSet']
# Grab the shot chart data
shots = response.json()['resultSets'][0]['rowSet']
To know the type of the variables headers
and shots
, we can proceed as follows:
type(headers)
list
type(shots)
list
Now we can create a pandas DataFrame
using the scraped shot chart data.
Remember, the data is saved in the objects shots
and headers
created in the previous step.
To create the DataFrame
, we can proceed as follows:
shot_df = pd.DataFrame(data=shots, columns=headers)
At this point, we have a pandas' DataFrame
ready to use.
The DataFrame
shot_df
contains the shot chart data of all the the field goal attempts Lebron James took during the 2014-15 regular season.
We are specifically interested in the data saved in the columns LOC_X
, LOC_Y
and SHOT_MADE_FLAG
.
LOC_X
and LOC_Y
are the coordinate values for each shot attempted and SHOT_MADE_FLAG
contains the outcome of the shot (missed it or made it).
To display the data saved in shot_df
, you can proceed as follows
#shot_df.head()
shot_df.head(4)
GRID_TYPE | GAME_ID | GAME_EVENT_ID | PLAYER_ID | PLAYER_NAME | TEAM_ID | TEAM_NAME | PERIOD | MINUTES_REMAINING | SECONDS_REMAINING | ... | ACTION_TYPE | SHOT_TYPE | SHOT_ZONE_BASIC | SHOT_ZONE_AREA | SHOT_ZONE_RANGE | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Shot Chart Detail | 0021400018 | 4 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 11 | 20 | ... | Jump Shot | 2PT Field Goal | Mid-Range | Right Side Center(RC) | 16-24 ft. | 18 | 114 | 148 | 1 | 0 |
1 | Shot Chart Detail | 0021400018 | 33 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 6 | 30 | ... | Layup Shot | 2PT Field Goal | Restricted Area | Center(C) | Less Than 8 ft. | 0 | -7 | 0 | 1 | 1 |
2 | Shot Chart Detail | 0021400018 | 53 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 4 | 45 | ... | Fadeaway Jump Shot | 2PT Field Goal | Mid-Range | Left Side(L) | 8-16 ft. | 12 | -105 | 63 | 1 | 0 |
3 | Shot Chart Detail | 0021400018 | 77 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 2 | 31 | ... | Jump Shot | 3PT Field Goal | Right Corner 3 | Right Side(R) | 24+ ft. | 22 | 227 | -16 | 1 | 0 |
4 rows × 21 columns
As we are only interested in printing the first 4 rows, we use .head(4). If you do not use .head(4) it will display all the rows.
To print a concise summary of the DataFrame
,
shot_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1279 entries, 0 to 1278 Data columns (total 21 columns): GRID_TYPE 1279 non-null object GAME_ID 1279 non-null object GAME_EVENT_ID 1279 non-null int64 PLAYER_ID 1279 non-null int64 PLAYER_NAME 1279 non-null object TEAM_ID 1279 non-null int64 TEAM_NAME 1279 non-null object PERIOD 1279 non-null int64 MINUTES_REMAINING 1279 non-null int64 SECONDS_REMAINING 1279 non-null int64 EVENT_TYPE 1279 non-null object ACTION_TYPE 1279 non-null object SHOT_TYPE 1279 non-null object SHOT_ZONE_BASIC 1279 non-null object SHOT_ZONE_AREA 1279 non-null object SHOT_ZONE_RANGE 1279 non-null object SHOT_DISTANCE 1279 non-null int64 LOC_X 1279 non-null int64 LOC_Y 1279 non-null int64 SHOT_ATTEMPTED_FLAG 1279 non-null int64 SHOT_MADE_FLAG 1279 non-null int64 dtypes: int64(11), object(10) memory usage: 219.8+ KB
To print the names of the columns in the DataFrame
,
shot_df.columns
Index([ u'GRID_TYPE', u'GAME_ID', u'GAME_EVENT_ID', u'PLAYER_ID', u'PLAYER_NAME', u'TEAM_ID', u'TEAM_NAME', u'PERIOD', u'MINUTES_REMAINING', u'SECONDS_REMAINING', u'EVENT_TYPE', u'ACTION_TYPE', u'SHOT_TYPE', u'SHOT_ZONE_BASIC', u'SHOT_ZONE_AREA', u'SHOT_ZONE_RANGE', u'SHOT_DISTANCE', u'LOC_X', u'LOC_Y', u'SHOT_ATTEMPTED_FLAG', u'SHOT_MADE_FLAG'], dtype='object')
And in a similar way we can print the names of the rows (indexs) in the DataFrame
,
shot_df.index
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 1269, 1270, 1271, 1272, 1273, 1274, 1275, 1276, 1277, 1278], dtype='int64', length=1279)
We can also display all the columns belonging to rows 0 to 2, as follows,
shot_df.iloc[0:2,:]
GRID_TYPE | GAME_ID | GAME_EVENT_ID | PLAYER_ID | PLAYER_NAME | TEAM_ID | TEAM_NAME | PERIOD | MINUTES_REMAINING | SECONDS_REMAINING | EVENT_TYPE | ACTION_TYPE | SHOT_TYPE | SHOT_ZONE_BASIC | SHOT_ZONE_AREA | SHOT_ZONE_RANGE | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Shot Chart Detail | 0021400018 | 4 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 11 | 20 | Missed Shot | Jump Shot | 2PT Field Goal | Mid-Range | Right Side Center(RC) | 16-24 ft. | 18 | 114 | 148 | 1 | 0 |
1 | Shot Chart Detail | 0021400018 | 33 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 6 | 30 | Made Shot | Layup Shot | 2PT Field Goal | Restricted Area | Center(C) | Less Than 8 ft. | 0 | -7 | 0 | 1 | 1 |
or we can display the first two columns of the first four rows, as follows,
shot_df.iloc[0:4,0:2]
GRID_TYPE | GAME_ID | |
---|---|---|
0 | Shot Chart Detail | 0021400018 |
1 | Shot Chart Detail | 0021400018 |
2 | Shot Chart Detail | 0021400018 |
3 | Shot Chart Detail | 0021400018 |
If you do not define the labels of the columns in the pandas DataFrame
, the column labels will default to np.arange(n)
.
shot_df1 = pd.DataFrame(data=shots)
shot_df1.head(2)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Shot Chart Detail | 0021400018 | 4 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 11 | 20 | ... | Jump Shot | 2PT Field Goal | Mid-Range | Right Side Center(RC) | 16-24 ft. | 18 | 114 | 148 | 1 | 0 |
1 | Shot Chart Detail | 0021400018 | 33 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 6 | 30 | ... | Layup Shot | 2PT Field Goal | Restricted Area | Center(C) | Less Than 8 ft. | 0 | -7 | 0 | 1 | 1 |
2 rows × 21 columns
If you have many columns, pandas will not shown all them. To force pandas to display all the columns, you can proceed as follows,
#This will force pandas to display any number of columns.
#pd.set_option('display.max_columns', 6)
pd.set_option('display.max_columns', None)
shot_df.head(2)
GRID_TYPE | GAME_ID | GAME_EVENT_ID | PLAYER_ID | PLAYER_NAME | TEAM_ID | TEAM_NAME | PERIOD | MINUTES_REMAINING | SECONDS_REMAINING | EVENT_TYPE | ACTION_TYPE | SHOT_TYPE | SHOT_ZONE_BASIC | SHOT_ZONE_AREA | SHOT_ZONE_RANGE | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Shot Chart Detail | 0021400018 | 4 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 11 | 20 | Missed Shot | Jump Shot | 2PT Field Goal | Mid-Range | Right Side Center(RC) | 16-24 ft. | 18 | 114 | 148 | 1 | 0 |
1 | Shot Chart Detail | 0021400018 | 33 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 6 | 30 | Made Shot | Layup Shot | 2PT Field Goal | Restricted Area | Center(C) | Less Than 8 ft. | 0 | -7 | 0 | 1 | 1 |
Alternatively, you can use the API for display objects in IPython.
# View the head of the DataFrame and all its columns
from IPython.display import display
with pd.option_context('display.max_columns', None): display(shot_df.head(2))
GRID_TYPE | GAME_ID | GAME_EVENT_ID | PLAYER_ID | PLAYER_NAME | TEAM_ID | TEAM_NAME | PERIOD | MINUTES_REMAINING | SECONDS_REMAINING | EVENT_TYPE | ACTION_TYPE | SHOT_TYPE | SHOT_ZONE_BASIC | SHOT_ZONE_AREA | SHOT_ZONE_RANGE | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Shot Chart Detail | 0021400018 | 4 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 11 | 20 | Missed Shot | Jump Shot | 2PT Field Goal | Mid-Range | Right Side Center(RC) | 16-24 ft. | 18 | 114 | 148 | 1 | 0 |
1 | Shot Chart Detail | 0021400018 | 33 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 6 | 30 | Made Shot | Layup Shot | 2PT Field Goal | Restricted Area | Center(C) | Less Than 8 ft. | 0 | -7 | 0 | 1 | 1 |
If you want to save the pandas DataFrame
in a csv file, you can proceed as follows
shot_df.to_csv(path_or_buf='test.csv',mode='w')
If you want to time the execution of a Python statement or expression, you can use the %time
magic,
%time shot_df.to_csv(path_or_buf='test.csv',mode='w')
CPU times: user 18.4 ms, sys: 3.11 ms, total: 21.5 ms Wall time: 425 ms
If you want to add reusability to your code, you can create functions. Let us create a function to save the json data in csv format.
def savecsv(name_of_file):
shot_df.to_csv(path_or_buf=name_of_file,mode='w')
We can now access the function,
name_of_file='test1.csv'
savecsv(name_of_file)
If you want to save the server data from the url shot_chart_url
in a json file, we need to use the module json [6].
import json
with open('data_json.json', 'w') as outfile:
json.dump(response.json(), outfile)
#json.dump(response.json(), open('data1_json.json', 'w'))
#This method is not working
#obj = open('data1_json.json', 'wb')
#obj.write(str(response.json()))
#obj.close
By the way, do not erase the csv and json files, as we are going to use them later.
Finally, in the blog post [7], you will find a very nice tutorial of how to create NBA shot charts. Hereafter, we are going to address and elaborate on most of the things explained in the mentioned blog post.
DataFrame
that we just created by scraping data from a web site.¶
#import sys
#print('Python version:', sys.version_info)
#import IPython
#print('IPython version:', IPython.__version__)
#print('Requests version', requests.__version__)
#print('Pandas version:', pd.__version__)
#print('json version:', json.__version__)