Python Data Science with Pandas¶

Agenda:

Selecting and slicing data with loc and iloc
Selecting data by value
Selecting data from a time series by date and time values
Applying and mapping functions to pandas dataframes and series

In [ ]:

# magic command to display matplotlib plots inline within the ipython notebook webpage
%matplotlib inline

# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt

Part 1
Basics of Selecting and Slicing Data¶

In [ ]:

# Operation                       Syntax         Result
#----------------------------------------------------------
# Select column                   df[col]        Series
# Select row by label             df.loc[label]  Series
# Select row by integer location  df.iloc[loc]   Series
# Slice rows                      df[5:10]       DataFrame
# Select rows by boolean vector   df[bool_vec]   DataFrame

In [ ]:

# create a pandas dataframe from the location data set
df = pd.read_csv('data/summer-travel-gps-full.csv')
df.head()

With Python strings and lists, you can access individual elements inside the object by indexing or slicing with square brackets like my_list[0]. The same syntax works on pandas dataframes and series.

In [ ]:

# get a column from the dataframe by its label with [] indexing - returns column as a pandas series with the dataframe's index
df['city'].head()

You can slice a pandas dataframe to grab multiple columns or rows at once, by their index, just like you can slice a list or a string. Here, we just pass a list of 2 column names in:

In [ ]:

# get multiple columns by their labels by passing a list of column names within the [] operator - returns a dataframe
df[['city', 'country']].head()

To get a single "cell's" value out of a dataframe, pass a column name, then a row label. This is equivalent to slicing the dataframe down to a single series, then slicing a single value out of that series using [ ] indexing.

In [ ]:

# get a value using the [] operator for a column label then a row label
df['city'][0]

Using .loc[ ]¶

In [ ]:

# use .loc to select by row label - returns the row as a series whose index is the dataframe column labels
df.loc[0]

In [ ]:

# use .loc to select by row label and column label
df.loc[0, 'country']

In [ ]:

# 0:3 is a slice of rows with label 0 to label 3
# ['city', 'date'] is a list of column labels
df.loc[0:3, ['city', 'date']]

In [ ]:

# slice by rows and columns
df.loc[0:3, 'date':'country']

In [ ]:

# you can set values with .loc as well
print(df.loc[0, 'city'])
df.loc[0, 'city'] = 'London'
print(df.loc[0, 'city'])

Using .iloc[ ]¶

In [ ]:

# use .iloc for integer position based indexing
# get the value from the row in position 3 and the column in position 2
df.iloc[3, 2]

In [ ]:

# you can use .iloc with slices too
# slice rows from position 112 to 115 and columns from position 2 to 4
# iloc is not inclusive, so for example "from 2 to 4" will return positions 2 and 3 (but not 4)
df.iloc[112:115, 2:4]

In [ ]:

# you can use iloc to select every nth row from a data set
n = 300
df.iloc[range(0, len(df), n)]

In [ ]:

# it's easier to tell the difference between .loc and .iloc if the index labels aren't the same as their positions
df.index = [label**2 for label in df.index]
df.head()

In [ ]:

# this returns the rows with labels between 4 and 9 (.loc is inclusive)
df.loc[4:9]

In [ ]:

# this returns rows with labels in positions 4 through 8 (not through 9, because .iloc is not inclusive)
df.iloc[4:9]

Part 2
How to select rows by some value(s)¶

In [ ]:

# load a reduced set of gps data
df = pd.read_csv('data/summer-travel-gps-simplified.csv')
df.tail()

In [ ]:

# create a Series of true/false, indicating if each row in the column is equal to some value
df['city']=='Munich'

In [ ]:

# now, select only those rows in the df that match that condition
df[df['city']=='Munich']

In [ ]:

# pandas logical operators are: | for or, & for and, ~ for not
# these must be grouped by using parentheses
# what cities were visited in spain that were not barcelona?
not_bcn = df[(df['country']=='Spain') & ~(df['city']=='Barcelona')]
not_bcn

In [ ]:

not_bcn['city'].unique()

In [ ]:

# select rows where either the city is munich, or the country is serbia
df[(df['city']=='Munich') | (df['country']=='Serbia')]

In [ ]:

# how many observations are west of the prime meridian?
len(df[df['lon'] < 0])

In [ ]:

# get all rows that contain a city that starts with the letter G
row_mask = df['city'].str.startswith('G')
df[row_mask]

In [ ]:

# select all rows with certain city names by using .isin([list])
row_mask = df['city'].isin(['Munich', 'Berat', 'Maia', 'Sarajevo'])
df[row_mask]

Part 3
How to select based on a date-time values¶

In [ ]:

# load the location data set, indexed by the date field
# and, parse the dates so they're no longer strings but now rather Python datetime objects
# this lets us do date and time based operations on the data set
dt = pd.read_csv('data/summer-travel-gps-full.csv', index_col='date', parse_dates=True)
dt.head()

In [ ]:

len(dt)

In [ ]:

# 1759 rows - but is the timestamp index unique?
dt.index.is_unique

In [ ]:

# the index is not unique - drop duplicates
dt.drop_duplicates(inplace=True)
len(dt)

In [ ]:

# the .weekday attribute determines which day of the week a date is
# 0 is sunday and 6 is saturday, M-F are 1-5
# what day of the week is each datetime in our dataframe's index?
str(list(dt.index.weekday))

In [ ]:

# use bitwise OR to create a boolean vector of which rows are a weekend date
weekend_mask = (dt.index.weekday==6) | (dt.index.weekday==0)
weekend_mask

In [ ]:

weekends = dt[weekend_mask]
weekdays = dt[~weekend_mask]

In [ ]:

hourly_share = pd.DataFrame()

# calculate what share of the weekday observations each hour has
weekday_hourly = weekdays.groupby(weekdays.index.hour).size()
hourly_share['weekday'] = weekday_hourly / sum(weekday_hourly)

# calculate what share of the weekend observations each hour has
weekend_hourly = weekends.groupby(weekends.index.hour).size()
hourly_share['weekend'] = weekend_hourly / sum(weekend_hourly)

# format the x-axis ticks like 0:00 times and plot the data
hourly_share.index = [s + ':00' for s in hourly_share.index.astype(str)]
hourly_share.plot(figsize=[9, 4], kind='bar', stacked=False, alpha=0.7, title='Share of observations, by hour')

In [ ]:

# calculate and plot the number of observations each day of the week has
daily_count = dt.groupby(dt.index.weekday).size()
daily_count.index = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
daily_count.plot(figsize=[8, 5], kind='bar', title='Number of observations, by day of week')

Part 4
map(), apply(), and applymap() in pandas¶

These methods are useful for mapping/applying a function across elements, rows, and columns of a pandas DataFrame or Series. But they have some important and often confusing differences.

map() applies a function element-wise on a Series
apply() works on a row or column basis on a DataFrame (specify the axis!), or on a row basis on a Series
applymap() works element-wise on an entire DataFrame

Let's see what that means in practice with some examples.

In [ ]:

# create a new DataFrame with fake year data
df = pd.DataFrame({'start_year':[2001, 2002, 2005, 2005, 2006], 
                   'end_year':[2002, 2010, 2008, 2006, 2014]})
df

In [ ]:

# you can iterate through a DataFrame using the .iterrows() method
for _, row in df.iterrows():
    start_year = row['start_year']
    if start_year > 2004:
        print(start_year + 10)
    else:
        print(start_year)

alternatively, .map() applies a function element-wise on a Series¶

In [ ]:

# create a function
def get_new_year(x):
    if x > 2004:
        return x + 10
    else:
        return x

# then map it to the series
df['start_year'].map(get_new_year)

In [ ]:

# or do the same thing all in one line, using a lambda function as .map()'s argument... you commonly see this in pandas
df['start_year'].map(lambda x: x + 10 if x > 2004 else x)

A lambda function is a simple, one-off, anonymous function. You can't call it again later because it doesn't have a name. It just lets you repeatedly perform some operation across a series of values (in our case, a column in our dataframe) using a minimal amount of code. Also notice that the if-else statement is all on one line: this is called a ternary operator or an inline-if.

In [ ]:

# you can easily create a new column to contain the results of the function mapping
df['new_year'] = df['start_year'].map(get_new_year)
df.head()

.apply() is like .map(), but it works on a row or column basis on an entire DataFrame (specify the axis)¶

In [ ]:

# applies a function to calculate the difference between the min and max values in each column (ie, row-wise)
def get_difference(vector):
    difference = vector.max() - vector.min()
    return difference

df.apply(get_difference, axis=0)

In [ ]:

# same thing again, using a lambda function
df.apply(lambda x: x.max() - x.min(), axis=0)

In [ ]:

# here .apply() finds the difference between the min and max values in each row (ie, column-wise) and saves to a new column
df['difference'] = df.apply(get_difference, axis=1)
df

.applymap() works element-wise on an entire DataFrame¶

This is like doing a .map() to each column in the DataFrame

In [ ]:

# divide every value in the dataframe by 2 (use a float so you don't do rounded integer division)
df.applymap(lambda x: x / 2.)

In [ ]:

Python Data Science with Pandas¶

Part 1 Basics of Selecting and Slicing Data¶

Using .loc[ ]¶

Using .iloc[ ]¶

Part 2 How to select rows by some value(s)¶

Part 3 How to select based on a date-time values¶

Part 4 map(), apply(), and applymap() in pandas¶

alternatively, .map() applies a function element-wise on a Series¶

.apply() is like .map(), but it works on a row or column basis on an entire DataFrame (specify the axis)¶

.applymap() works element-wise on an entire DataFrame¶

Part 1
Basics of Selecting and Slicing Data¶

Part 2
How to select rows by some value(s)¶

Part 3
How to select based on a date-time values¶

Part 4
map(), apply(), and applymap() in pandas¶