Python for the Energy Industry
Pandas is a python module design for working with tabular data. The core data structure of pandas is the DataFrame. DataFrames share a lot in common with numpy arrays, with two main differences:
A DataFrame can be created by pandas from a dictionary, where each key is a column label, and the values are the corresponding values for each 'entry'.
import pandas as pd
country_df = pd.DataFrame({
'Country': ['China','US','Russia','UK'],
'Population': [1439323776, 331002651, 145934462, 67886011],
'HDI': [0.758, 0.920, 0.824, 0.920]
})
print(country_df)
Country Population HDI 0 China 1439323776 0.758 1 US 331002651 0.920 2 Russia 145934462 0.824 3 UK 67886011 0.920
In this example, there a 4 entries representing countries, and corresponding population and HDI values. A particular column can be accessed in the same way as data is accessed in a dictionary:
print(country_df['Population'])
0 1439323776 1 331002651 2 145934462 3 67886011 Name: Population, dtype: int64
Or multiple columns can be accessed at once:
print(country_df[['Country','Population']])
Country Population 0 China 1439323776 1 US 331002651 2 Russia 145934462 3 UK 67886011
You can see that these entries have numeric indicies from 0-3. We can also use text labels for indices instead, by setting one of the columns to be the index:
country_df.set_index('Country',inplace=True)
print(country_df)
Population HDI Country China 1439323776 0.758 US 331002651 0.920 Russia 145934462 0.824 UK 67886011 0.920
This can make it a bit easier to read data from a single column:
print(country_df['HDI'])
Country China 0.758 US 0.920 Russia 0.824 UK 0.920 Name: HDI, dtype: float64
We can also access all data corresponding to a single entry in the DataFrame. This can be done either by the entry name (if text indices are being used) or by its numerical index.
# Accessing the third entry in the DataFrame
print(country_df.iloc[2])
Population 1.459345e+08 HDI 8.240000e-01 Name: Russia, dtype: float64
# Accessing the entry with the index 'China'
print(country_df.loc['China'])
Population 1.439324e+09 HDI 7.580000e-01 Name: China, dtype: float64
Note: we will be working a lot with pandas a lot throughout the course. If you want to learn more about any particular features of pandas, check out the pandas documentation.
You can use numpy arrays as a source of data when creating a DataFrame. Make a DataFrame with two columns 'A' and 'B', each of which have 10 random numbers.