In this series of tutorials, we are going to address how to do data scraping, data wrangling, data analytics and advaced plotting. This is a sport analytics case.
In this particular tutorial we address how to extract data or data scraping from the web. We also address some basic DataFrame
manipulation.
%matplotlib inline
import requests
import pandas as pd
In order to do data analytics, we need data. So the first thing to do is get the data from somewhere.
Hereafter we are going to work with NBA players data. In the NBA site stats.nba.com you will find all the statistics for every single player.
In the following blog post [1]
http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
you will find a pretty good explanation of how to get the data from stats.nba.com [2] (or any other site for this matter).
Remember, the most important part when scraping data from a web site, is knowing how to access the API used to collect the data.
To get Lebron James shot chart data we will use this url:
#Lebron James data
#PlayerID is the number in the nba.stats site
shot_chart_url = 'http://stats.nba.com/stats/shotchartdetail?CFID=33&CFPAR'\
'AMS=2014-15&ContextFilter=&ContextMeasure=FGA&DateFrom=&D'\
'ateTo=&GameID=&GameSegment=&LastNGames=0&LeagueID=00&Loca'\
'tion=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&'\
'PaceAdjust=N&PerMode=PerGame&Period=0&PlayerID=2544&Plu'\
'sMinus=N&Position=&Rank=N&RookieYear=&Season=2014-15&Seas'\
'onSegment=&SeasonType=Regular+Season&TeamID=0&VsConferenc'\
'e=&VsDivision=&mode=Advanced&showDetails=0&showShots=1&sh'\
'owZones=0'
Have in mind that the previous url corresponds to Lebron James shot chart data. If you want to access different data, another player or the data for a specifc team, you will need to find the right link.
By now, we already know that the file is in json format. If you have JSONView [3] installed in your web browser, you can view the content of the file. Just copy and paste the link in your web browser
Now let's use the module requests [4] to get the server data from shot_chart_url
.
requests is an Apache2 Licensed HTTP library, you can find more information about requests in the following link http://docs.python-requests.org/en/latest/#
# Get the webpage containing the data
response = requests.get(shot_chart_url)
We can view the server data header as follows:
response.headers
Now we can access the header data using any string we want, no need to say that the string must exist in the server headers:
response.headers['content-type']
As we can see, the data is in json format.
By the way, when we were doing web scraping we already knew that the data was in json format. I just showed you how to get that info from the header.
In the module requests, there is also a builtin JSON decoder. To print the content of responce
, we can proceed as follows
#uncomment this line to print the content of responce
#response.json()
#to time the function
#%time response.json()
Now we are ready to get the data we want in order to construct the pandas [5] DataFrame
.
# Grab the headers to be used as column headers for our DataFrame
headers = response.json()['resultSets'][0]['headers']
The data is in standard json format, and is made of three main blocks.
We are interested in getting the data from the block resultSets
, therefore the notation response.json()['resultSets']
The resultSets
block has two sub-blocks, we want to access the first one or the one with the name Shot_Chart_Detail
, therefore the notation response.json()['resultSets'][0]
Now we want to access the information contained in headers
, hence the notation response.json()['resultSets'][0]['headers']
The object response.json()['resultSets'][0]['headers']
contains the headers' names of each column.
To grab the shot chart data or Shot Chart Detail
in the json data, we proceed in a similar way. Have in mind that the data of interest is located in the block rowSet
. Therefore the notation response.json()['resultSets'][0]['rowSet']
# Grab the shot chart data
shots = response.json()['resultSets'][0]['rowSet']
To know the type of the variables headers
and shots
, we can proceed as follows:
type(headers)
type(shots)
Now we can create a pandas DataFrame
using the scraped shot chart data.
Remember, the data is saved in the objects shots
and headers
created in the previous step.
To create the DataFrame
, we can proceed as follows:
shot_df = pd.DataFrame(data=shots, columns=headers)
At this point, we have a pandas' DataFrame
ready to use.
The DataFrame
shot_df
contains the shot chart data of all the the field goal attempts Lebron James took during the 2014-15 regular season.
We are specifically interested in the data saved in the columns LOC_X
, LOC_Y
and SHOT_MADE_FLAG
.
LOC_X
and LOC_Y
are the coordinate values for each shot attempted and SHOT_MADE_FLAG
contains the outcome of the shot (missed it or made it).
To display the data saved in shot_df
, you can proceed as follows
#shot_df.head()
shot_df.head(4)
As we are only interested in printing the first 4 rows, we use .head(4). If you do not use .head(4) it will display all the rows.
To print a concise summary of the DataFrame
,
shot_df.info()
To print the names of the columns in the DataFrame
,
shot_df.columns
And in a similar way we can print the names of the rows (indexs) in the DataFrame
,
shot_df.index
We can also display all the columns belonging to rows 0 to 2, as follows,
shot_df.iloc[0:2,:]
or we can display the first two columns of the first four rows, as follows,
shot_df.iloc[0:4,0:2]
If you do not define the labels of the columns in the pandas DataFrame
, the column labels will default to np.arange(n)
.
shot_df1 = pd.DataFrame(data=shots)
shot_df1.head(2)
If you have many columns, pandas will not shown all them. To force pandas to display all the columns, you can proceed as follows,
#This will force pandas to display any number of columns.
#pd.set_option('display.max_columns', 6)
pd.set_option('display.max_columns', None)
shot_df.head(2)
Alternatively, you can use the API for display objects in IPython.
# View the head of the DataFrame and all its columns
from IPython.display import display
with pd.option_context('display.max_columns', None): display(shot_df.head(2))
If you want to save the pandas DataFrame
in a csv file, you can proceed as follows
shot_df.to_csv(path_or_buf='test.csv',mode='w')
If you want to time the execution of a Python statement or expression, you can use the %time
magic,
%time shot_df.to_csv(path_or_buf='test.csv',mode='w')
If you want to add reusability to your code, you can create functions. Let us create a function to save the json data in csv format.
def savecsv(name_of_file):
shot_df.to_csv(path_or_buf=name_of_file,mode='w')
We can now access the function,
name_of_file='test1.csv'
savecsv(name_of_file)
If you want to save the server data from the url shot_chart_url
in a json file, we need to use the module json [6].
import json
with open('data_json.json', 'w') as outfile:
json.dump(response.json(), outfile)
#json.dump(response.json(), open('data1_json.json', 'w'))
#This method is not working
#obj = open('data1_json.json', 'wb')
#obj.write(str(response.json()))
#obj.close
By the way, do not erase the csv and json files, as we are going to use them later.
Finally, in the blog post [7], you will find a very nice tutorial of how to create NBA shot charts. Hereafter, we are going to address and elaborate on most of the things explained in the mentioned blog post.
DataFrame
that we just created by scraping data from a web site.¶
#import sys
#print('Python version:', sys.version_info)
#import IPython
#print('IPython version:', IPython.__version__)
#print('Requests version', requests.__version__)
#print('Pandas version:', pd.__version__)
#print('json version:', json.__version__)