import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from IPython.display import Image
import warnings
warnings.filterwarnings('ignore') # ignores annoying warnings
In the past few lectures we have learned about several different ways to filter data, from masking in Numpy arrays, to filter( ) in the last lecture. Today we are going to learn about filtering pandas DataFrames to accomplish some of the same tasks, but with more flexibility and ease. Pandas provides a powerful way to tease out useful information.
We will be looking at data on Holocene Eruptions from the Smithsonian Holocene Volcano Database: http://volcano.si.edu/list_volcano_holocene.cfm (Global Volcanism Program, 2013. Volcanoes of the World, v. 4.8.7. Venzke, E (ed.). Smithsonian Institution. Downloaded 14 Mar 2020. https://doi.org/10.5479/si.GVP.VOTW4-2013). This link will download an xml file, which Excel can read, but for Pandas to work on it, you have to convert it to a regular .xls file from within excel. I have done that for you for the following.
We will see how to filter these data to pull out interesting information on Holocene Eruptions. Let's read in this data and first look at its length. To read an Excel spreadsheet, we can use a new function pd.read_excel( ). If you look in the file, you will see that the header is in the second line (not the first), so header=1 in the argument list, after the PATH to the file.
EruptionData=pd.read_excel('Datasets/GVP_Volcano_List_Holocene.xls',header=1)
print('Number of Eruptions:',len(EruptionData))
EruptionData.head()
Wow, that's a lot of Eruptions! However, the DataFrame has a lot of information we really aren't interested in. For example, there are many eruptions in this data for which the evidence isn't strong ('Evidence Uncertain'). You can verify this, by printing out the Series "Activity Evidence".
EruptionData['Activity Evidence']
In Lecture 9, we learned how to filter a DataFrame by putting what we wanted in a conditional statement enclosed in square brackets. Remembering from Lecture 4 that the conditional for "equal to" is "==", we can retrieve all the rows that contain 'Eruption Observed' in the column 'Activity Evidence' like this this:
#notice the conditional '==' which means 'equals to' from Lecture 4
ObservedEruptions=EruptionData[EruptionData['Activity Evidence']=='Eruption Observed']
ObservedEruptions.head()
Pandas DataFrames also have a method called .loc that allows for filtering of DataFrames in a similar way to the familiar conditional above.
EruptionData.loc[EruptionData['Activity Evidence']=='Eruption Observed'].head()
This statement does exactly the same thing as the conditional. The syntax of a .loc statement might look trickier, but trust us, it will make your life easier as things get more complicated. It is is computationally faster and has more tricks up its sleeve as we shall see soon. :)
Now let's look at some big eruptions we might be interested in (and who wouldn't be?). One of the most famous Volcanic eruptions is the 1980 Eruption of Mount St. Helens (Washington State). To find it, let's search for Holocene Eruptions of Mount St. Helens.
Image(filename='Figures/StHelens.jpg')
Image from: Global Volcanism Program, 2013. St. Helens (321050) in Volcanoes of the World, v. 4.7.5. Venzke, E (ed.). Smithsonian Institution. Downloaded 31 Dec 2018 (https://volcano.si.edu/volcano.cfm?vn=321050)
EruptionData.loc[EruptionData['Volcano Name']=='St. Helens']
As we can see, simple conditional statements like this enable us to filter large datasets for the small amount of information we're interested in.
Although the above statement would work equally well without the .loc method, we can add some whistles and bells. The use of the .loc syntax allows you search through a particular column (Series) by putting a comma after your conditional statement followed by another Series name. Say we wanted the 'Last Known Eruption' of all stratovolcanoes. We could do this:
stratos=EruptionData.loc[EruptionData['Primary Volcano Type'].str.contains('Stratovolcano'),'Last Known Eruption']
stratos.head()
Here we have used the syntax DataFrame.Series.str.contains( ). This allows us to get not only the type "Stratovolcano", but also "Stratovolcano(es)" and anything that has "Stratovolcano" in it.
It is worth pointing out another way to accomplish the same thing using the method isin(). You can create a list of things you want (or don't want), then test if the string is in the list. Here is how it would work for this example to select things in the list:
volcano_types=['Stratovolcano','Stratovolcano(es)']
stratos=EruptionData[EruptionData['Primary Volcano Type'].isin(volcano_types)]
stratos.head()
And here's how it works if you don't want things in the list:
volcano_types=['Stratovolcano','Stratovolcano(es)']
not_stratos=EruptionData[EruptionData['Primary Volcano Type'].isin(volcano_types)==False]
not_stratos.head()
Moving on.
Now we can do stuff to this filtered DataFrame stratos. The .loc syntax also allows you to take a slice through the columns list to select a specific range of column headers:
ColumnSlice=EruptionData.loc[EruptionData['Primary Volcano Type'].str.contains('Stratovolcano'),
'Volcano Name':'Last Known Eruption']
ColumnSlice.head()
Something else .loc can do is to change the values inplace in DataFrames easily. Let's say we found a historical document that told us that the Unknown last eruption at Panarea was in 5000 BCE. We want to update the information in the DataFrame and can do it this way:
print ('before modifying:\n',EruptionData.loc[EruptionData['Volcano Name']=='Panarea']['Last Known Eruption'])
EruptionData.loc[EruptionData['Volcano Name']=='Panarea','Last Known Eruption']='5000 BCE'
# and let's take a look:
print ('after modifying:\n',EruptionData.loc[EruptionData['Volcano Name']=='Panarea'])
As we can see, the syntax for this can get complicated quickly, but we can retrieve and/or modify lots of data using a few lines of code.
What if we wanted to sort our dataset so the most northerly eruptions come out on top? Pandas DataFrames have a method for this called sort_values. Normally, this will sort from lowest to highest (an "ascending" sort), but we can use the argument ascending=False to tell it to sort from highest to lowest.
# First read in our DataFrame again:
EruptionData=pd.read_excel('Datasets/GVP_Volcano_List_Holocene.xls',header=1)
NorthernToSouthern=EruptionData.sort_values(by='Latitude',ascending=False)
NorthernToSouthern.head()
Looks like the most northerly eruptions during the Holocene were at the East Gakkel Ridge and that the most northerly above sea-level eruption was on Jan Mayen in 1985. I bet I'm the only person you know who has actually been there!
Now let's try to get the first 10 rows in this DataFrame. We can do this using .loc, right?
NorthernToSouthern.loc[0:10]
Oops! This didn't work as expected did it? Instead, we got the all the rows between the indices of 0 and 10 which are not in any particular order now. When we sorted by Latitude, Pandas did not assign new indices and put the records in no particular order within a particular Latitude value. This is a "feature" of sorting functions. So... to get what we really wanted, which was the first 10 records in the NorthernToSouthern DataFrame, we can use the method .iloc instead of .loc.
NorthernToSouthern.iloc[0:10]
Much better. Now we can see that there were lots of eruptions during the Holocene on Iceland. Can you think of how to get names of all of the Icelandic volcanoes that erupted during the Holocene?
But to solve our indexing problem with .loc( ), by re-indexing our sorted DataFrame. To re-index a Pandas DataFrame, we use the .set_index( ) method.
This will set the index to a list of values from 0 to the length of the Dataframe.
# make a list of integers between zero up to (but not including) the length of the DataFrame
newIndexValues=list(range(len(NorthernToSouthern)))
# reset the indices to this list
NorthernToSouthern=NorthernToSouthern.set_index([newIndexValues])
NorthernToSouthern.head()
Another thing about indices: We can set the indices to one of the other column names, for example the "Volcano Name".
NorthernToSouthern=NorthernToSouthern.set_index('Volcano Name')
NorthernToSouthern.head()
In this example data set, the dates for the last known eruption are the dates in CE or BCE or unknown, so we cannot sort by that column header. But we can first drop all the rows where the 'Last Known Eruption' is 'unknown', then split the dates on the space, multiply all the dates with 'BCE' in them by -1 and sort by the resulting column.
Let's do that step by step:
EruptionData=pd.read_excel('Datasets/GVP_Volcano_List_Holocene.xls',header=1) # read this in again
KnownEruptionDates=EruptionData[EruptionData['Last Known Eruption'].str.contains('unknown')==False]
To see if this worked, we can look at all the unique eruption dates and see if 'unknown' is still there:
KnownEruptionDates[KnownEruptionDates['Last Known Eruption'].str.contains('unknown')]
Nope. So now step 2:
EruptionData['Last Known Eruption'].str.split()
KnownEruptionDates['Last Known Eruption'].str.split().head()
Whoa! some of the entries are 'Unknown' and not 'unknown'. Sloppy! But we can handle that:
KnownEruptionDates=KnownEruptionDates[KnownEruptionDates['Last Known Eruption'].str.contains('Unknown')==False]
More on splitting: It turns out that a space is the default for split. We don't really want to do this, but if we DID, we could split on the 'C' like so:
KnownEruptionDates['Last Known Eruption'].str.split('C').head()
See how the .str.split() method returns a list with the stuff before the split key as the first element and the stuff after the key is the second.
We can make two arrays, one for the date and one for the 'CE' or 'BCE' tag. We first make an array (using the DataFrame method .values, transpose the array and assign the first row to a dataframe column named 'date' and the second row to a column named 'CE/BCE'.
dates=KnownEruptionDates['Last Known Eruption'].str.split(' ',expand=True).transpose()
print (dates.values)
# put in the 'date' column:
KnownEruptionDates['date']=dates.values[0].astype('int')
# put in the 'CE/BCE' column:
KnownEruptionDates['CE/BCE']=dates.values[1].astype('str')
Because Pandas read in the 'Last Known Eruption' as a string, we need to convert the first part to an integer (that is the .astype('int') bit above). This means we can multiply the date in records with 'BCE' by -1.
KnownEruptionDates.loc[KnownEruptionDates['CE/BCE'].str.contains('BCE'),'date']=-1*KnownEruptionDates['date']
KnownEruptionDates.sort_values(by='date').head()
We've been working on a data set that only had Holocene data in it, but the same Smithsonian website has a data set for Pleistocene volcanic eruptions too. We can concatentate both data sets into a single DataFrame (as long as they have the same columns) like this:
HoloceneEruptionData=pd.read_excel('Datasets/GVP_Volcano_List_Holocene.xls',header=1) # read this in again
PleistoceneEruptionData=pd.read_excel('Datasets/GVP_Volcano_List_Pleistocene.xls',header=1) # read this in again
# get both Unknown and unknown out!
RecentEruptionData=pd.concat([HoloceneEruptionData,PleistoceneEruptionData])
RecentEruptionData.head()
RecentEruptionData.tail()
So a lot less is known about the Pleistocene eruptions than the Holocene ones.
There is another data set on the Smithsonian Website that is interesting. It has a list of all the currently active volcanoes.
ActiveVolcanoes=pd.read_excel('Datasets/ActiveVolcanoes.xlsx')
ActiveVolcanoes.head()
There's a lot more information about each volcano in our HoloceneEruptionData DataFrame, which we could attach to the use Pandas merge( ) method. There are many ways to use merge, but the idea is to identify what kind of join you want:
Image(filename='Figures/join-types.jpg')
We want to pair all the stuff from the Holocene volcanoes database with the ActiveVolcanoes data if it isn't already there ('Country' for example is in both).
MergedVolcanoes=ActiveVolcanoes.merge(KnownEruptionDates,how='left',left_on='Volcano',right_on='Volcano Name')
MergedVolcanoes.head()
You see that if a particular column (say 'Country') is in both DataFrames, the first one gets renamed 'Country_x' and the other one 'Country_y'. To clean that up we can rename the first to 'Country', then delete (drop) the second one out of the DataFrame. We also want to delete the column 'Volcano Name' because it is redundant.
MergedVolcanoes.rename(columns={'Country_x':'Country'},inplace=True)
MergedVolcanoes.drop(['Country_y','Volcano Name'],axis=1,inplace=True)
MergedVolcanoes.head()
Looks like it worked.
Now that we have more information, we can start classifying these eruptions by type. For example, what tectonic settings are represented in this dataset? Pandas has a method called .unique( ) that allows us to find all the unique values in a column.
list(MergedVolcanoes['Tectonic Setting'].unique())
This tells us some useful information, including that some of the values are not a number (or 'nan' in Pandish). We can get rid of these using the method .dropna( ). While we are at it, we can delete the rows with no information on 'Volcanic Explosivity Index' data (Max VEI).
MergedVolcanoes.dropna(subset=['Tectonic Setting'],inplace=True)
MergedVolcanoes.head()
# inplace=True does the method 'in place' so we don't have to assign it to a new DataFrame
Pandas has a couple more methods that might be useful for looking at the distribution of these data. These are the .groupby( ) and .describe( ) methods. We can use these methods to look at the typical volcano types type for each tectonic setting in our dataset.
.groupby( ) groups things in your DataFrame by unique values in a Series, for example grouping everything by 'Tectonic Setting'. .describe( ) summarizes some useful statistics. So if we wanted to know basic statistics for each tectonic setting (and who wouldn't?), we would do:
MergedVolcanoes.groupby('Tectonic Setting')['Primary Volcano Type'].describe()
This tells us that around of 15 active shallow subduction zone volcanoes of which 15 were stratovolcanoes.
Create a new notebook with the name format: Lastname_Inital_HomeworkNumber. For example, Cych_B_HW_4
Create a markdown block in which you will describe what the notebook does.