04 Grabbing HTML tables with Pandas¶

What if you saw a table you wanted on a web page? For example: https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions.

Can Python help us download those data? Why yes. Yes it can.

Specifically, we use the Pandas' read_html function, which is able to identify tables in an HTML page and pull them out into a dataframe object.

In [ ]:

#Import pandas
import pandas

#Install lxml if not present
try:
    import lxml
except:
    !pip install lxml
    print("Please restart the kernel")

In [ ]:

#Here, the read_html function pulls into a list object any table in the URL we provide.
tableList = pandas.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions')
print ("{} tables were found".format(len(tableList)))

In [ ]:

#Let's grab the 2st table one and display it's first five rows
df = tableList[1]
df.head()

What we see here is a table with two header rows (and Pandas was smart enough to recognize that). This is a multi-index table - a powerful, but confusing format. Basically, it allows us to means to select columns:

In [ ]:

#Show the hierarchical column names
df.columns

In [ ]:

#Select all rows and just the columns under the "Fossil CO2 emissions(Mt CO2)" heading
df_fossil = df.loc[:,'Fossil CO2 emissions(Mt CO2)']
df_fossil.head()

In [ ]:

#Let's add the country to our subset
df_fossil = df.loc[:,['Country[20]','Fossil CO2 emissions(Mt CO2)']]
df_fossil.head()

In [ ]:

#We can drop the top level column indices
df_fossil = df_fossil.droplevel(level=0,axis=1)
df_fossil.head()

In [ ]:

#Tidy up the column names
df_fossil.columns = ['Country','1990_emmissions','2005_emmissions','2017_emmissions']
#Set the country as the index
df_fossil.set_index('Country',inplace=True)
df_fossil.head()

In [ ]:

#Now we can save it to a local file using df.to_csv()
df_fossil.to_csv("CarbonEmissions.csv", # The output filename
          index=True,            # We opt not to write out the index
          encoding='utf8')       # This deals with issues surrounding countries with odd characters

In [ ]:

#...or we can examine it
#Here is as quick preview of pandas' plotting capability
df_fossil.iloc[3:].plot.scatter(x='1990_emmissions',y='2017_emmissions',grid=True);