04 Grabbing HTML tables with Pandas¶

What if you saw a table you wanted on a web page? For example:https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions. Can Python help us download those data?

Why yes. Yes it can.

Specifically, we use the Pandas' read_html function, which is able to identify tables in an HTML page and pull them out into a dataframe object.

In [ ]:

#Import pandas
import pandas

In [ ]:

#We'll need a package called lxml; install if not already
try:
    import lxml
except:
    !pip install lxml

In [ ]:

#Here, the read_html function pulls into a list object any table in the URL we provide.
the_url = 'https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions'
tableList = pandas.read_html(the_url)
print ("{} tables were found".format(len(tableList)))

In [ ]:

#Let's grab the 2nd table one and display it's firt five rows
df = tableList[1]
df.head()

As an aside, the resulting table has multiple column indices. Mutliindex dataframes are powerful, but also quite confusing. We'll simply drop the first header row using the drop_level() command

In [ ]:

#Fetch just the columns associated with the top column level of "Fossil CO2 emissions(Mt CO2)"
df_fixed = df.droplevel(
    level=0,         #drops the first header row
    axis ='columns') #tells pandas we are dealing with columns, not rows
df_fixed.head()

In [ ]:

#Now we can save it to a local file using df.to_csv()
df.to_csv("Carbon.csv",    # The output filename
          index=False,     # We opt not to write out the index
          encoding='utf8') # This deals with issues surrounding countries with odd characters

In [ ]:

#...or we can examine it
#Here is as quick preview of pandas' plotting capability
%matplotlib inline
df_fixed.iloc[3:,].plot.scatter(x='1990',y='2017');