What if you saw a table you wanted on a web page? For example:https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions. Can Python help us download those data?
Why yes. Yes it can.
Specifically, we use the Pandas' read_html
function, which is able to identify tables in an HTML page and pull them out into a dataframe object.
#Import pandas
import pandas
#We'll need a package called lxml; install if not already
try:
import lxml
except:
!pip install lxml
#Here, the read_html function pulls into a list object any table in the URL we provide.
the_url = 'https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions'
tableList = pandas.read_html(the_url)
print ("{} tables were found".format(len(tableList)))
#Let's grab the 2nd table one and display it's firt five rows
df = tableList[1]
df.head()
As an aside, the resulting table has multiple column indices. Mutliindex dataframes are powerful, but also quite confusing. We'll simply drop the first header row using the drop_level()
command
#Fetch just the columns associated with the top column level of "Fossil CO2 emissions(Mt CO2)"
df_fixed = df.droplevel(
level=0, #drops the first header row
axis ='columns') #tells pandas we are dealing with columns, not rows
df_fixed.head()
#Now we can save it to a local file using df.to_csv()
df.to_csv("Carbon.csv", # The output filename
index=False, # We opt not to write out the index
encoding='utf8') # This deals with issues surrounding countries with odd characters
#...or we can examine it
#Here is as quick preview of pandas' plotting capability
%matplotlib inline
df_fixed.iloc[3:,].plot.scatter(x='1990',y='2017');