What if you saw a table you wanted on a web page? For example: https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions.
Can Python help us download those data? Why yes. Yes it can.
Specifically, we use the Pandas' read_html
function, which is able to identify tables in an HTML page and pull them out into a dataframe object.
#Import pandas
import pandas
#Install lxml if not present
try:
import lxml
except:
!pip install lxml
print("Please restart the kernel")
#Here, the read_html function pulls into a list object any table in the URL we provide.
tableList = pandas.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions')
print ("{} tables were found".format(len(tableList)))
#Let's grab the 2st table one and display it's first five rows
df = tableList[1]
df.head()
What we see here is a table with two header rows (and Pandas was smart enough to recognize that). This is a multi-index table - a powerful, but confusing format. Basically, it allows us to means to select columns:
#Show the hierarchical column names
df.columns
#Select all rows and just the columns under the "Fossil CO2 emissions(Mt CO2)" heading
df_fossil = df.loc[:,'Fossil CO2 emissions(Mt CO2)']
df_fossil.head()
#Let's add the country to our subset
df_fossil = df.loc[:,['Country[20]','Fossil CO2 emissions(Mt CO2)']]
df_fossil.head()
#We can drop the top level column indices
df_fossil = df_fossil.droplevel(level=0,axis=1)
df_fossil.head()
#Tidy up the column names
df_fossil.columns = ['Country','1990_emmissions','2005_emmissions','2017_emmissions']
#Set the country as the index
df_fossil.set_index('Country',inplace=True)
df_fossil.head()
#Now we can save it to a local file using df.to_csv()
df_fossil.to_csv("CarbonEmissions.csv", # The output filename
index=True, # We opt not to write out the index
encoding='utf8') # This deals with issues surrounding countries with odd characters
#...or we can examine it
#Here is as quick preview of pandas' plotting capability
df_fossil.iloc[3:].plot.scatter(x='1990_emmissions',y='2017_emmissions',grid=True);