Pandas Fundamentals

This material is adapted from the Earth and Environmental Data Science, from Ryan Abernathey (Columbia University).

In [1]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
# %config InlineBackend.figure_format = 'retina'
plt.ion()  # To trigger the interactive inline mode

Pandas Data Structures: Series

A Series represents a one-dimensional array of data. The main difference between a Series and numpy array is that a Series has an index. The index contains the labels that we use to access the data.

There are many ways to create a Series. We will just show a few.

In [11]:
names = ['Yasi', 'Debbie', 'Yasa']
values = [5., 4., 5.]
cyclones = pd.Series(values, index=names)
cyclones
Out[11]:
Yasi      5.0
Debbie    4.0
Yasa      5.0
dtype: float64

Series have built in plotting methods.

In [12]:
cyclones.plot(kind='bar')
Out[12]:
<AxesSubplot:>

Arithmetic operations and most numpy function can be applied to Series. An important point is that the Series keep their index during such operations.

In [13]:
np.log(cyclones) / cyclones**2
Out[13]:
Yasi      0.064378
Debbie    0.086643
Yasa      0.064378
dtype: float64

We can access the underlying index object if we need to:

In [14]:
cyclones.index
Out[14]:
Index(['Yasi', 'Debbie', 'Yasa'], dtype='object')

Indexing

We can get values back out using the index via the .loc attribute

In [15]:
cyclones.loc['Debbie']
Out[15]:
4.0

Or by raw position using .iloc

In [16]:
cyclones.iloc[1]
Out[16]:
4.0

We can pass a list or array to loc to get multiple rows back:

In [17]:
cyclones.loc[['Debbie', 'Yasa']]
Out[17]:
Debbie    4.0
Yasa      5.0
dtype: float64

And we can even use slice notation

In [21]:
cyclones.loc['Debbie':'Yasa']
Out[21]:
Debbie    4.0
Yasa      5.0
dtype: float64
In [20]:
cyclones.iloc[:2]
Out[20]:
Yasi      5.0
Debbie    4.0
dtype: float64

If we need to, we can always get the raw data back out as well

In [22]:
cyclones.values # a numpy array
Out[22]:
array([5., 4., 5.])
In [23]:
cyclones.index # a pandas Index object
Out[23]:
Index(['Yasi', 'Debbie', 'Yasa'], dtype='object')

Pandas Data Structures: DataFrame

There is a lot more to Series, but they are limit to a single "column". A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index. It's a lot like a table in a spreadsheet.

Below we create a DataFrame.

In [24]:
# first we create a dictionary
data = {'category': [5, 4, 5],
        'windspeed': [250., 215., 260.],
        'fatalities': [1, 14, 4]}
df = pd.DataFrame(data, index=['Yasi', 'Debbie', 'Yasa'])
df
Out[24]:
category windspeed fatalities
Yasi 5 250.0 1
Debbie 4 215.0 14
Yasa 5 260.0 4

Pandas handles missing data very elegantly, keeping track of it through all calculations.

In [25]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, Yasi to Yasa
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   category    3 non-null      int64  
 1   windspeed   3 non-null      float64
 2   fatalities  3 non-null      int64  
dtypes: float64(1), int64(2)
memory usage: 96.0+ bytes

A wide range of statistical functions are available on both Series and DataFrames.

In [26]:
df.min()
Out[26]:
category        4.0
windspeed     215.0
fatalities      1.0
dtype: float64
In [27]:
df.mean()
Out[27]:
category        4.666667
windspeed     241.666667
fatalities      6.333333
dtype: float64
In [28]:
df.std()
Out[28]:
category       0.577350
windspeed     23.629078
fatalities     6.806859
dtype: float64
In [29]:
df.describe()
Out[29]:
category windspeed fatalities
count 3.000000 3.000000 3.000000
mean 4.666667 241.666667 6.333333
std 0.577350 23.629078 6.806859
min 4.000000 215.000000 1.000000
25% 4.500000 232.500000 2.500000
50% 5.000000 250.000000 4.000000
75% 5.000000 255.000000 9.000000
max 5.000000 260.000000 14.000000

We can get a single column as a Series using python's getitem syntax on the DataFrame object.

In [30]:
df['windspeed']
Out[30]:
Yasi      250.0
Debbie    215.0
Yasa      260.0
Name: windspeed, dtype: float64

...or using attribute syntax.

In [31]:
df.windspeed
Out[31]:
Yasi      250.0
Debbie    215.0
Yasa      260.0
Name: windspeed, dtype: float64

Indexing works very similar to series

In [32]:
df.loc['Debbie']
Out[32]:
category        4.0
windspeed     215.0
fatalities     14.0
Name: Debbie, dtype: float64
In [33]:
df.iloc[1]
Out[33]:
category        4.0
windspeed     215.0
fatalities     14.0
Name: Debbie, dtype: float64

But we can also specify the column we want to access

In [34]:
df.loc['Debbie', 'fatalities']
Out[34]:
14
In [35]:
df.iloc[:2, 2]
Out[35]:
Yasi       1
Debbie    14
Name: fatalities, dtype: int64

If we make a calculation using columns from the DataFrame, it will keep the same index:

In [36]:
df.windspeed / df.category
Out[36]:
Yasi      50.00
Debbie    53.75
Yasa      52.00
dtype: float64

Which we can easily add as another column to the DataFrame:

In [37]:
df['ratio'] = df.windspeed / df.category
df
Out[37]:
category windspeed fatalities ratio
Yasi 5 250.0 1 50.00
Debbie 4 215.0 14 53.75
Yasa 5 260.0 4 52.00

Merging Data

Pandas supports a wide range of methods for merging different datasets. These are described extensively in the documentation. Here we just give a few examples.

In [44]:
cyclone = pd.Series(['$246.7 M', None, '$2.73 B', '$3.6 B'],
                     index=['Yasa', 'Tino', 'Debbie', 'Yasi'],
                     name='cost')
cyclone
Out[44]:
Yasa      $246.7 M
Tino          None
Debbie     $2.73 B
Yasi        $3.6 B
Name: cost, dtype: object
In [45]:
# returns a new DataFrame
df.join(cyclone)
Out[45]:
category windspeed fatalities ratio cost
Yasi 5 250.0 1 50.00 $3.6 B
Debbie 4 215.0 14 53.75 $2.73 B
Yasa 5 260.0 4 52.00 $246.7 M
In [46]:
# returns a new DataFrame
df.join(cyclone, how='right')
Out[46]:
category windspeed fatalities ratio cost
Yasa 5.0 260.0 4.0 52.00 $246.7 M
Tino NaN NaN NaN NaN None
Debbie 4.0 215.0 14.0 53.75 $2.73 B
Yasi 5.0 250.0 1.0 50.00 $3.6 B
In [47]:
# returns a new DataFrame
cyc = df.reindex(['Yasa', 'Tino', 'Debbie', 'Yasi', 'Tracy'])
cyc
Out[47]:
category windspeed fatalities ratio
Yasa 5.0 260.0 4.0 52.00
Tino NaN NaN NaN NaN
Debbie 4.0 215.0 14.0 53.75
Yasi 5.0 250.0 1.0 50.00
Tracy NaN NaN NaN NaN

We can also index using a boolean series. This is very useful

In [50]:
huge = df[df.category > 4]
huge
Out[50]:
category windspeed fatalities ratio is_tragic
Yasi 5 250.0 1 50.0 True
Yasa 5 260.0 4 52.0 True
In [51]:
df['is_huge'] = df.category > 4
df
Out[51]:
category windspeed fatalities ratio is_tragic is_huge
Yasi 5 250.0 1 50.00 True True
Debbie 4 215.0 14 53.75 True False
Yasa 5 260.0 4 52.00 True True

Modifying Values

We often want to modify values in a dataframe based on some rule. To modify values, we need to use .loc or .iloc

In [52]:
df.loc['Debbie', 'windspeed'] = 285
df.loc['Yasa', 'category'] += 1
df
Out[52]:
category windspeed fatalities ratio is_tragic is_huge
Yasi 5 250.0 1 50.00 True True
Debbie 4 285.0 14 53.75 True False
Yasa 6 260.0 4 52.00 True True

Plotting

DataFrames have all kinds of useful plotting built in.

In [53]:
df.plot(kind='scatter', x='category', y='windspeed', grid=True)
Out[53]:
<AxesSubplot:xlabel='category', ylabel='windspeed'>
In [54]:
df.plot(kind='bar')
Out[54]:
<AxesSubplot:>

Time Indexes

Indexes are very powerful. They are a big part of why Pandas is so useful. There are different indices for different types of data. Time Indexes are especially great!

In [38]:
two_years = pd.date_range(start='2014-01-01', end='2016-01-01', freq='D')
timeseries = pd.Series(np.sin(2 *np.pi *two_years.dayofyear / 365),
                       index=two_years)
timeseries.plot()
Out[38]:
<AxesSubplot:>

We can use python's slicing notation inside .loc to select a date range.

In [39]:
timeseries.loc['2015-01-01':'2015-07-01'].plot()
Out[39]:
<AxesSubplot:>

The TimeIndex object has lots of useful attributes

In [40]:
timeseries.index.month
Out[40]:
Int64Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
            ...
            12, 12, 12, 12, 12, 12, 12, 12, 12,  1],
           dtype='int64', length=731)
In [41]:
timeseries.index.day
Out[41]:
Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
            ...
            23, 24, 25, 26, 27, 28, 29, 30, 31,  1],
           dtype='int64', length=731)

Reading Data Files

In this example, we will use the eReefs Extraction tool from AIMS and look at Davies Reef which is located at 18.8S/147.6E.

The CSV file was downloaded from the AIMS extraction tool with the following settings:

  • Data collection: GBR1 Hydro (Version 2)
  • Variables: Eastward wind speed (wspeed_u), Northward wind speed (wspeed_v), Northward current (v), Eastward current (u)
  • Date range: 2010-2021
  • Time step: hourly Depths: -2.35 m

Once the extraction request was submitted the dataset was created after an one hour of processing the data was available for download from Extraction request: Example dataset: Wind-vs-Current at Davies and Myrmidon Reefs (2019).

After download we have a file called DaviesReef_timeseries.csv.

:::{tip} To read it into pandas, we will use the read_csv function. This function is incredibly complex and powerful. You can use it to extract data from almost any text file. However, you need to understand how to use its various options. :::

With no options, this is what we get.

In [56]:
df = pd.read_csv('examples/DaviesReef_timeseries.csv')
df.head()
Out[56]:
date current wind salinity temperature
0 2010-09-01 0.18839505 7.481102 35.47361 25.17717
1 2010-09-02 0.28618774 4.9959917 35.48525 25.26...
2 2010-09-03 0.29367417 6.024843 35.462696 25.20...
3 2010-09-04 0.28546846 5.7049136 35.44163 25.38...
4 2010-09-05 0.34016508 4.3658934 35.41293 25.7196

Pandas failed to identify the different columns. This is because it was expecting standard CSV (comma-separated values) file. In our file, instead, the values are separated by whitespace. And not a single whilespace--the amount of whitespace between values varies. We can tell pandas this using the sep keyword.

In [57]:
df = pd.read_csv('examples/DaviesReef_timeseries.csv', sep='\s+')
df.head()
Out[57]:
date current wind salinity temperature
0 2010-09-01 0.188395 7.481102 35.473610 25.177170
1 2010-09-02 0.286188 4.995992 35.485250 25.262072
2 2010-09-03 0.293674 6.024843 35.462696 25.201515
3 2010-09-04 0.285468 5.704914 35.441630 25.385157
4 2010-09-05 0.340165 4.365893 35.412930 25.719600

Great! It worked.

Often missing values are set either to -9999 or -99. Let's tell this to pandas by assigning nan to these specific values.

In [61]:
df = pd.read_csv('examples/DaviesReef_timeseries.csv', sep='\s+', na_values=[-9999.0, -99.0])
df.head()
Out[61]:
date current wind salinity temperature
0 2010-09-01 0.188395 7.481102 35.473610 25.177170
1 2010-09-02 0.286188 4.995992 35.485250 25.262072
2 2010-09-03 0.293674 6.024843 35.462696 25.201515
3 2010-09-04 0.285468 5.704914 35.441630 25.385157
4 2010-09-05 0.340165 4.365893 35.412930 25.719600

Great. The missing data is now represented by NaN.

What data types did pandas infer?

In [62]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3815 entries, 0 to 3814
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         3815 non-null   object 
 1   current      3815 non-null   float64
 2   wind         3815 non-null   float64
 3   salinity     3815 non-null   float64
 4   temperature  3815 non-null   float64
dtypes: float64(4), object(1)
memory usage: 149.1+ KB

One problem here is that Pandas did not recognize the date column as a date. Let's help it.

In [63]:
df = pd.read_csv('examples/DaviesReef_timeseries.csv', sep='\s+',
                 na_values=[-9999.0, -99.0],
                 parse_dates=['date'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3815 entries, 0 to 3814
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         3815 non-null   datetime64[ns]
 1   current      3815 non-null   float64       
 2   wind         3815 non-null   float64       
 3   salinity     3815 non-null   float64       
 4   temperature  3815 non-null   float64       
dtypes: datetime64[ns](1), float64(4)
memory usage: 149.1 KB

It worked! Finally, let's tell pandas to use the date column as the index.

In [64]:
df = df.set_index('date')
df.head()
Out[64]:
current wind salinity temperature
date
2010-09-01 0.188395 7.481102 35.473610 25.177170
2010-09-02 0.286188 4.995992 35.485250 25.262072
2010-09-03 0.293674 6.024843 35.462696 25.201515
2010-09-04 0.285468 5.704914 35.441630 25.385157
2010-09-05 0.340165 4.365893 35.412930 25.719600

We can now access values by time:

In [65]:
df.loc['2014-08-07']
Out[65]:
current         0.381858
wind           11.198019
salinity       35.307660
temperature    23.866120
Name: 2014-08-07 00:00:00, dtype: float64

Or use slicing to get a range:

In [73]:
df2014 = df.loc['2014-01-01':'2014-12-31']
df2016 = df.loc['2016-01-01':'2016-12-31']
df2018 = df.loc['2018-01-01':'2018-12-31']
df2020 = df.loc['2020-01-01':'2020-12-31']
df2014
Out[73]:
current wind salinity temperature
date
2014-01-01 0.400576 5.939211 35.446510 28.987250
2014-01-02 0.382042 3.748479 35.437440 29.117306
2014-01-03 0.396745 4.321397 35.470460 29.362583
2014-01-04 0.438115 5.537916 35.474777 29.393948
2014-01-05 0.440387 5.804630 35.441086 29.072742
... ... ... ... ...
2014-12-27 0.229368 5.385712 35.412727 28.865137
2014-12-28 0.205974 3.610898 35.416306 29.263445
2014-12-29 0.303206 3.092528 35.432480 29.667050
2014-12-30 0.254966 2.479212 35.479332 30.116217
2014-12-31 0.285224 2.497501 35.500930 30.287010

365 rows × 4 columns

Quick Statistics

In [74]:
df.describe()
Out[74]:
current wind salinity temperature
count 3815.000000 3815.000000 3815.000000 3815.000000
mean 0.265826 6.708479 35.182442 26.610127
std 0.084196 2.643628 0.292870 1.936466
min 0.061780 1.245048 32.956253 22.933332
25% 0.209833 4.613128 35.042687 24.853140
50% 0.256591 6.528472 35.263058 26.700006
75% 0.308909 8.581003 35.375860 28.180752
max 0.817050 18.900612 35.763687 31.598990

Plotting Values

We can now quickly make plots of the data

In [77]:
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(14,14))

df2014.iloc[:, :].boxplot(ax=ax[0,0])
df2016.iloc[:, :].boxplot(ax=ax[0,1])
df2018.iloc[:, :].boxplot(ax=ax[1,0])
df2020.iloc[:, :].boxplot(ax=ax[1,1])
Out[77]:
<AxesSubplot:>

Pandas is very "time aware":

In [78]:
df.temperature.plot()
Out[78]:
<AxesSubplot:xlabel='date'>

Note: we could also manually create an axis and plot into it.

In [79]:
fig, ax = plt.subplots()
df2018.temperature.plot(ax=ax)
ax.set_title('Pandas Made This!')
Out[79]:
Text(0.5, 1.0, 'Pandas Made This!')
In [80]:
df2020[['temperature', 'wind', 'salinity']].plot()
Out[80]:
<AxesSubplot:xlabel='date'>

Resampling

Since pandas understands time, we can use it to do resampling.

In [81]:
# monthly reampler object
rs_obj = df.resample('MS')
rs_obj
Out[81]:
<pandas.core.resample.DatetimeIndexResampler object at 0x7f9ac0daf100>
In [82]:
rs_obj.mean()
Out[82]:
current wind salinity temperature
date
2010-09-01 0.281347 5.869099 35.365671 26.017173
2010-10-01 0.283938 8.118046 35.368278 26.372507
2010-11-01 0.268329 7.007014 35.384552 26.872662
2010-12-01 0.342128 5.501357 35.019846 29.165949
2011-01-01 0.306163 5.960649 34.620039 29.236425
... ... ... ... ...
2020-10-01 0.257407 6.045590 35.242590 25.873325
2020-11-01 0.268678 6.381081 35.325458 27.001825
2020-12-01 0.288298 5.469824 35.251699 28.221937
2021-01-01 0.289114 7.052687 34.391954 28.522526
2021-02-01 0.236967 3.992934 34.692678 29.264261

126 rows × 4 columns

We can chain all of that together

In [83]:
df_mm = df.resample('MS').mean()
df_mm[['temperature', 'wind', 'salinity']].plot()
Out[83]:
<AxesSubplot:xlabel='date'>

Next notebook, we will dig deeper into resampling, rolling means, and grouping operations (groupby).

In [ ]: