Module 3¶

Video 13: Pandas¶

Python for the Energy Industry

Pandas is a python module design for working with tabular data. The core data structure of pandas is the DataFrame. DataFrames share a lot in common with numpy arrays, with two main differences:

They can also store non-numeric data
The columns in a DataFrame are generally given text labels

A DataFrame can be created by pandas from a dictionary, where each key is a column label, and the values are the corresponding values for each 'entry'.

In [1]:

import pandas as pd

country_df = pd.DataFrame({
    'Country': ['China','US','Russia','UK'],
    'Population': [1439323776, 331002651, 145934462, 67886011],
    'HDI': [0.758, 0.920, 0.824, 0.920]
})

print(country_df)

  Country  Population    HDI
0   China  1439323776  0.758
1      US   331002651  0.920
2  Russia   145934462  0.824
3      UK    67886011  0.920

In this example, there a 4 entries representing countries, and corresponding population and HDI values. A particular column can be accessed in the same way as data is accessed in a dictionary:

In [2]:

print(country_df['Population'])

0    1439323776
1     331002651
2     145934462
3      67886011
Name: Population, dtype: int64

Or multiple columns can be accessed at once:

In [3]:

print(country_df[['Country','Population']])

  Country  Population
0   China  1439323776
1      US   331002651
2  Russia   145934462
3      UK    67886011

You can see that these entries have numeric indicies from 0-3. We can also use text labels for indices instead, by setting one of the columns to be the index:

In [4]:

country_df.set_index('Country',inplace=True)

print(country_df)

         Population    HDI
Country                   
China    1439323776  0.758
US        331002651  0.920
Russia    145934462  0.824
UK         67886011  0.920

This can make it a bit easier to read data from a single column:

In [5]:

print(country_df['HDI'])

Country
China     0.758
US        0.920
Russia    0.824
UK        0.920
Name: HDI, dtype: float64

We can also access all data corresponding to a single entry in the DataFrame. This can be done either by the entry name (if text indices are being used) or by its numerical index.

In [6]:

# Accessing the third entry in the DataFrame
print(country_df.iloc[2])

Population    1.459345e+08
HDI           8.240000e-01
Name: Russia, dtype: float64

In [7]:

# Accessing the entry with the index 'China'
print(country_df.loc['China'])

Population    1.439324e+09
HDI           7.580000e-01
Name: China, dtype: float64

Note: we will be working a lot with pandas a lot throughout the course. If you want to learn more about any particular features of pandas, check out the pandas documentation.

Exercise¶

You can use numpy arrays as a source of data when creating a DataFrame. Make a DataFrame with two columns 'A' and 'B', each of which have 10 random numbers.

In [ ]: