Data scraping, data wrangling, data analytics, exploratory data analysis, advaced plotting and clustering with love.

In this series of tutorials, we are going to address how to do data scraping, data wrangling, data analytics and advaced plotting. This is a sport analytics case.

In this particular tutorial we address how to extract data or data scraping from the web. We also address some basic DataFrame manipulation.

In [ ]:
 
In [2]:
%matplotlib inline
In [ ]:
 
In [3]:
import requests
import pandas as pd
In [ ]:
 

In order to do data analytics, we need data. So the first thing to do is get the data from somewhere.

Hereafter we are going to work with NBA players data. In the NBA site stats.nba.com you will find all the statistics for every single player.

In the following blog post [1]

http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/

you will find a pretty good explanation of how to get the data from stats.nba.com [2] (or any other site for this matter).

Remember, the most important part when scraping data from a web site, is knowing how to access the API used to collect the data.

In [ ]:
 

To get Lebron James shot chart data we will use this url:

In [4]:
#Lebron James data
#PlayerID is the number in the nba.stats site

shot_chart_url = 'http://stats.nba.com/stats/shotchartdetail?CFID=33&CFPAR'\
                'AMS=2014-15&ContextFilter=&ContextMeasure=FGA&DateFrom=&D'\
                'ateTo=&GameID=&GameSegment=&LastNGames=0&LeagueID=00&Loca'\
                'tion=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&'\
                'PaceAdjust=N&PerMode=PerGame&Period=0&PlayerID=2544&Plu'\
                'sMinus=N&Position=&Rank=N&RookieYear=&Season=2014-15&Seas'\
                'onSegment=&SeasonType=Regular+Season&TeamID=0&VsConferenc'\
                'e=&VsDivision=&mode=Advanced&showDetails=0&showShots=1&sh'\
                'owZones=0'  
                

Have in mind that the previous url corresponds to Lebron James shot chart data. If you want to access different data, another player or the data for a specifc team, you will need to find the right link.

In [ ]:
 
In [ ]:
 

Now let's use the module requests [4] to get the server data from shot_chart_url.

requests is an Apache2 Licensed HTTP library, you can find more information about requests in the following link http://docs.python-requests.org/en/latest/#

In [5]:
# Get the webpage containing the data
response = requests.get(shot_chart_url)
In [ ]:
 

We can view the server data header as follows:

In [6]:
response.headers
Out[6]:
{'content-length': '22207', 'content-encoding': 'gzip', 'expires': 'Sun, 06 Sep 2015 15:06:20 GMT', 'vary': 'Accept-Encoding', 'server': 'Microsoft-IIS/7.5', 'last-modified': 'Sun, 06 Sep 2015 15:06:20 GMT', 'connection': 'keep-alive', 'pragma': 'no-cache', 'cache-control': 'no-cache, no-store, must-revalidate', 'date': 'Sun, 06 Sep 2015 15:06:20 GMT', 'x-powered-by': 'ASP.NET', 'content-type': 'application/json; charset=utf-8', 'x-aspnet-version': '4.0.30319'}
In [ ]:
 

Now we can access the header data using any string we want, no need to say that the string must exist in the server headers:

In [7]:
response.headers['content-type']
Out[7]:
'application/json; charset=utf-8'

As we can see, the data is in json format.

By the way, when we were doing web scraping we already knew that the data was in json format. I just showed you how to get that info from the header.

In [ ]:
 

In the module requests, there is also a builtin JSON decoder. To print the content of responce, we can proceed as follows

In [8]:
#uncomment this line to print the content of responce
#response.json()

#to time the function
#%time response.json()
In [ ]:
 

Now we are ready to get the data we want in order to construct the pandas [5] DataFrame.

In [9]:
# Grab the headers to be used as column headers for our DataFrame
headers = response.json()['resultSets'][0]['headers']

The data is in standard json format, and is made of three main blocks.

We are interested in getting the data from the block resultSets, therefore the notation response.json()['resultSets']

The resultSets block has two sub-blocks, we want to access the first one or the one with the name Shot_Chart_Detail, therefore the notation response.json()['resultSets'][0]

Now we want to access the information contained in headers, hence the notation response.json()['resultSets'][0]['headers']

The object response.json()['resultSets'][0]['headers'] contains the headers' names of each column.

In [ ]:
 

To grab the shot chart data or Shot Chart Detail in the json data, we proceed in a similar way. Have in mind that the data of interest is located in the block rowSet. Therefore the notation response.json()['resultSets'][0]['rowSet']

In [10]:
# Grab the shot chart data
shots = response.json()['resultSets'][0]['rowSet']
In [ ]:
 

To know the type of the variables headers and shots, we can proceed as follows:

In [11]:
type(headers)
Out[11]:
list
In [12]:
type(shots)
Out[12]:
list
In [ ]:
 

Now we can create a pandas DataFrame using the scraped shot chart data.

Remember, the data is saved in the objects shots and headers created in the previous step.

To create the DataFrame, we can proceed as follows:

In [13]:
shot_df = pd.DataFrame(data=shots, columns=headers)

At this point, we have a pandas' DataFrame ready to use.

In [ ]:
 

The DataFrame shot_df contains the shot chart data of all the the field goal attempts Lebron James took during the 2014-15 regular season.

We are specifically interested in the data saved in the columns LOC_X, LOC_Y and SHOT_MADE_FLAG.

LOC_X and LOC_Y are the coordinate values for each shot attempted and SHOT_MADE_FLAG contains the outcome of the shot (missed it or made it).

To display the data saved in shot_df, you can proceed as follows

In [14]:
#shot_df.head()
shot_df.head(4)
Out[14]:
GRID_TYPE GAME_ID GAME_EVENT_ID PLAYER_ID PLAYER_NAME TEAM_ID TEAM_NAME PERIOD MINUTES_REMAINING SECONDS_REMAINING ... ACTION_TYPE SHOT_TYPE SHOT_ZONE_BASIC SHOT_ZONE_AREA SHOT_ZONE_RANGE SHOT_DISTANCE LOC_X LOC_Y SHOT_ATTEMPTED_FLAG SHOT_MADE_FLAG
0 Shot Chart Detail 0021400018 4 2544 LeBron James 1610612739 Cleveland Cavaliers 1 11 20 ... Jump Shot 2PT Field Goal Mid-Range Right Side Center(RC) 16-24 ft. 18 114 148 1 0
1 Shot Chart Detail 0021400018 33 2544 LeBron James 1610612739 Cleveland Cavaliers 1 6 30 ... Layup Shot 2PT Field Goal Restricted Area Center(C) Less Than 8 ft. 0 -7 0 1 1
2 Shot Chart Detail 0021400018 53 2544 LeBron James 1610612739 Cleveland Cavaliers 1 4 45 ... Fadeaway Jump Shot 2PT Field Goal Mid-Range Left Side(L) 8-16 ft. 12 -105 63 1 0
3 Shot Chart Detail 0021400018 77 2544 LeBron James 1610612739 Cleveland Cavaliers 1 2 31 ... Jump Shot 3PT Field Goal Right Corner 3 Right Side(R) 24+ ft. 22 227 -16 1 0

4 rows × 21 columns

As we are only interested in printing the first 4 rows, we use .head(4). If you do not use .head(4) it will display all the rows.

In [ ]:
 

To print a concise summary of the DataFrame,

In [57]:
shot_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1279 entries, 0 to 1278
Data columns (total 21 columns):
GRID_TYPE              1279 non-null object
GAME_ID                1279 non-null object
GAME_EVENT_ID          1279 non-null int64
PLAYER_ID              1279 non-null int64
PLAYER_NAME            1279 non-null object
TEAM_ID                1279 non-null int64
TEAM_NAME              1279 non-null object
PERIOD                 1279 non-null int64
MINUTES_REMAINING      1279 non-null int64
SECONDS_REMAINING      1279 non-null int64
EVENT_TYPE             1279 non-null object
ACTION_TYPE            1279 non-null object
SHOT_TYPE              1279 non-null object
SHOT_ZONE_BASIC        1279 non-null object
SHOT_ZONE_AREA         1279 non-null object
SHOT_ZONE_RANGE        1279 non-null object
SHOT_DISTANCE          1279 non-null int64
LOC_X                  1279 non-null int64
LOC_Y                  1279 non-null int64
SHOT_ATTEMPTED_FLAG    1279 non-null int64
SHOT_MADE_FLAG         1279 non-null int64
dtypes: int64(11), object(10)
memory usage: 219.8+ KB
In [ ]:
 

To print the names of the columns in the DataFrame,

In [58]:
shot_df.columns
Out[58]:
Index([          u'GRID_TYPE',             u'GAME_ID',       u'GAME_EVENT_ID',
                 u'PLAYER_ID',         u'PLAYER_NAME',             u'TEAM_ID',
                 u'TEAM_NAME',              u'PERIOD',   u'MINUTES_REMAINING',
         u'SECONDS_REMAINING',          u'EVENT_TYPE',         u'ACTION_TYPE',
                 u'SHOT_TYPE',     u'SHOT_ZONE_BASIC',      u'SHOT_ZONE_AREA',
           u'SHOT_ZONE_RANGE',       u'SHOT_DISTANCE',               u'LOC_X',
                     u'LOC_Y', u'SHOT_ATTEMPTED_FLAG',      u'SHOT_MADE_FLAG'],
      dtype='object')
In [ ]:
 

And in a similar way we can print the names of the rows (indexs) in the DataFrame,

In [59]:
shot_df.index
Out[59]:
Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9, 
            ...
            1269, 1270, 1271, 1272, 1273, 1274, 1275, 1276, 1277, 1278],
           dtype='int64', length=1279)
In [ ]:
 

We can also display all the columns belonging to rows 0 to 2, as follows,

In [60]:
shot_df.iloc[0:2,:]
Out[60]:
GRID_TYPE GAME_ID GAME_EVENT_ID PLAYER_ID PLAYER_NAME TEAM_ID TEAM_NAME PERIOD MINUTES_REMAINING SECONDS_REMAINING EVENT_TYPE ACTION_TYPE SHOT_TYPE SHOT_ZONE_BASIC SHOT_ZONE_AREA SHOT_ZONE_RANGE SHOT_DISTANCE LOC_X LOC_Y SHOT_ATTEMPTED_FLAG SHOT_MADE_FLAG
0 Shot Chart Detail 0021400018 4 2544 LeBron James 1610612739 Cleveland Cavaliers 1 11 20 Missed Shot Jump Shot 2PT Field Goal Mid-Range Right Side Center(RC) 16-24 ft. 18 114 148 1 0
1 Shot Chart Detail 0021400018 33 2544 LeBron James 1610612739 Cleveland Cavaliers 1 6 30 Made Shot Layup Shot 2PT Field Goal Restricted Area Center(C) Less Than 8 ft. 0 -7 0 1 1
In [ ]:
 

or we can display the first two columns of the first four rows, as follows,

In [61]:
shot_df.iloc[0:4,0:2]
Out[61]:
GRID_TYPE GAME_ID
0 Shot Chart Detail 0021400018
1 Shot Chart Detail 0021400018
2 Shot Chart Detail 0021400018
3 Shot Chart Detail 0021400018
In [ ]:
 

If you do not define the labels of the columns in the pandas DataFrame, the column labels will default to np.arange(n).

In [16]:
shot_df1 = pd.DataFrame(data=shots)
shot_df1.head(2)
Out[16]:
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
0 Shot Chart Detail 0021400018 4 2544 LeBron James 1610612739 Cleveland Cavaliers 1 11 20 ... Jump Shot 2PT Field Goal Mid-Range Right Side Center(RC) 16-24 ft. 18 114 148 1 0
1 Shot Chart Detail 0021400018 33 2544 LeBron James 1610612739 Cleveland Cavaliers 1 6 30 ... Layup Shot 2PT Field Goal Restricted Area Center(C) Less Than 8 ft. 0 -7 0 1 1

2 rows × 21 columns

In [ ]:
 

If you have many columns, pandas will not shown all them. To force pandas to display all the columns, you can proceed as follows,

In [17]:
#This will force pandas to display any number of columns.
#pd.set_option('display.max_columns', 6)
pd.set_option('display.max_columns', None)
shot_df.head(2)
Out[17]:
GRID_TYPE GAME_ID GAME_EVENT_ID PLAYER_ID PLAYER_NAME TEAM_ID TEAM_NAME PERIOD MINUTES_REMAINING SECONDS_REMAINING EVENT_TYPE ACTION_TYPE SHOT_TYPE SHOT_ZONE_BASIC SHOT_ZONE_AREA SHOT_ZONE_RANGE SHOT_DISTANCE LOC_X LOC_Y SHOT_ATTEMPTED_FLAG SHOT_MADE_FLAG
0 Shot Chart Detail 0021400018 4 2544 LeBron James 1610612739 Cleveland Cavaliers 1 11 20 Missed Shot Jump Shot 2PT Field Goal Mid-Range Right Side Center(RC) 16-24 ft. 18 114 148 1 0
1 Shot Chart Detail 0021400018 33 2544 LeBron James 1610612739 Cleveland Cavaliers 1 6 30 Made Shot Layup Shot 2PT Field Goal Restricted Area Center(C) Less Than 8 ft. 0 -7 0 1 1
In [ ]:
 

Alternatively, you can use the API for display objects in IPython.

In [18]:
# View the head of the DataFrame and all its columns
from IPython.display import display
with pd.option_context('display.max_columns', None): display(shot_df.head(2))
    
GRID_TYPE GAME_ID GAME_EVENT_ID PLAYER_ID PLAYER_NAME TEAM_ID TEAM_NAME PERIOD MINUTES_REMAINING SECONDS_REMAINING EVENT_TYPE ACTION_TYPE SHOT_TYPE SHOT_ZONE_BASIC SHOT_ZONE_AREA SHOT_ZONE_RANGE SHOT_DISTANCE LOC_X LOC_Y SHOT_ATTEMPTED_FLAG SHOT_MADE_FLAG
0 Shot Chart Detail 0021400018 4 2544 LeBron James 1610612739 Cleveland Cavaliers 1 11 20 Missed Shot Jump Shot 2PT Field Goal Mid-Range Right Side Center(RC) 16-24 ft. 18 114 148 1 0
1 Shot Chart Detail 0021400018 33 2544 LeBron James 1610612739 Cleveland Cavaliers 1 6 30 Made Shot Layup Shot 2PT Field Goal Restricted Area Center(C) Less Than 8 ft. 0 -7 0 1 1
In [ ]:
 

If you want to save the pandas DataFrame in a csv file, you can proceed as follows

In [19]:
shot_df.to_csv(path_or_buf='test.csv',mode='w')
In [ ]:
 

If you want to time the execution of a Python statement or expression, you can use the %time magic,

In [20]:
%time shot_df.to_csv(path_or_buf='test.csv',mode='w')
CPU times: user 18.4 ms, sys: 3.11 ms, total: 21.5 ms
Wall time: 425 ms
In [ ]:
 

If you want to add reusability to your code, you can create functions. Let us create a function to save the json data in csv format.

In [21]:
def savecsv(name_of_file):
    shot_df.to_csv(path_or_buf=name_of_file,mode='w')
    

We can now access the function,

In [22]:
name_of_file='test1.csv'
savecsv(name_of_file)
In [ ]:
 

If you want to save the server data from the url shot_chart_url in a json file, we need to use the module json [6].

In [23]:
import json

with open('data_json.json', 'w') as outfile:
    json.dump(response.json(), outfile)

#json.dump(response.json(), open('data1_json.json', 'w'))   
In [24]:
#This method is not working
#obj = open('data1_json.json', 'wb')
#obj.write(str(response.json()))
#obj.close
In [ ]:
 

By the way, do not erase the csv and json files, as we are going to use them later.

In [ ]:
 

Finally, in the blog post [7], you will find a very nice tutorial of how to create NBA shot charts. Hereafter, we are going to address and elaborate on most of the things explained in the mentioned blog post.

In [ ]:
 

In the next tutorial, we are going to do some data wrangling, data analytics and exploratory data analysis, using the DataFrame that we just created by scraping data from a web site.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [25]:
#import sys
#print('Python version:', sys.version_info)

#import IPython
#print('IPython version:', IPython.__version__)

#print('Requests version', requests.__version__)
#print('Pandas version:', pd.__version__)
#print('json version:', json.__version__)
In [ ]: