Weather Changes: an Introduction to Data Analysis with Python

This is a basic introduction to real-world data analysis and visualization with Python. To see the finished product and a bit more analysis, read my blog.

The first step in data analysis is finding the right data. For this project we'll be using the Global Historical Climatology Network (GHCN) monthly data compiled by the US government. In a previous post I explain where this data comes from.

A professor of mine, and an experienced data scientist, once mentioned that "80% of the effort in machine learning is getting the source data into a manageable format." The following section simply does exactly that. :) The source files are in a custom fixed-width-column format specified on the NOAA website.

The first block here parses three *.dat files, one each for minimum, maximum and average temperature measurements. Each of these files consists of one line per station per year. A line includes the station ID, year of observation, and measurements for all 12 months.

These files are simply read in, split at the correct locations, and written out to a CSV file. This will allow us to import the data into pandas directly using pandas's read_csv functionality.

In [1]:
from __future__ import division
from glob import glob
import csv,re

for name in glob('data/raw/*.dat'):
    label = name[15:19]
    print "Parsing data:",label
    with open(name) as in_file:
        with open("data/csv/"+label+".csv",'w') as out_file:
            writer = csv.writer(out_file)
            header = ["id","year"]
            header += [str(month+1)+"_"+label for month in xrange(12)]

            for line in in_file:
                # if int(line[11:15]) < 2000:
                #     continue
                components = [line[0:11],line[11:15]]
                components += [int(line[19+x*8:24+x*8])/100 for x in range(12)]
Parsing data: tavg
Parsing data: tmax
Parsing data: tmin

A country-codes file is provided, which correlates each station ID with a country. Because this file is small we can read it straight into a python dictionary of the form {[country_code]: [country_name]}, which we will use in the next step to associate countries with station IDs.

In [2]:
def title(name):
    """Title case with correct handling of internal apostrophes"""
    return re.sub("([a-z])'([A-Z])", lambda m:, name.title())

cc_file = open('data/raw/country-codes')
ccodes = {l[:3]:title(l[4:].strip()) for l in cc_file}

Index files, ending in *.inv, correlate the station IDs in the .dat files with station metadata, such as latitude, longitude and station name. The following lines open each .inv file and convert them to a CSV, as above. It also uses the ccodes index from above to look up the country name for each file.

In [3]:
for name in glob('data/raw/*.inv'):
    label = name[15:19]
    print "Parsing index:",label
    with open(name) as in_file:
        with open("data/csv/"+label+"-inv.csv",'w') as out_file:
            writer = csv.writer(out_file)
            for line in in_file:
                # if not line[74:79].strip().isdigit() or int(line[74:79]) < 100:
                #     continue
                name = title(line[38:68])
                name = name[:name.find("  ")].strip()
                components = [line[0:11],line[12:20],line[21:30],name,ccodes[line[:3]]]
Parsing index: tavg
Parsing index: tmin
Parsing index: tmax

Aggregating the Data

Now that all the relevant data is stored as easily-parsed CSV files, we can begin processing it!

The main tool we'll be using for this is a pandas DataFrame. Think of a dataframe as much like a table in SQL or even a spreadsheet in Excel with lots of data indexing, retreival and aggregation tools built right in. For more information on DataFrames, see the pandas introduction. Pandas lets you build a dataframe from a dictionary, but in this case we'll be reading one right from a CSV file on disk. It looks something like this:

In [4]:
import pandas as pd

tavg_dataframe = pd.read_csv("data/csv/tavg.csv")
id year 1_tavg 2_tavg 3_tavg 4_tavg 5_tavg 6_tavg 7_tavg 8_tavg 9_tavg 10_tavg 11_tavg 12_tavg
0 10160355000 1878 9.60 10.2 11.80 16.80 20.5 23.10 25.60 27.50 23.9 -99.99 14.40 12.20
1 10160355000 1879 12.50 12.4 12.90 16.20 16.3 23.40 24.70 25.80 23.1 18.20 15.20 9.70
2 10160355000 1880 10.30 11.8 13.10 15.90 17.8 21.40 26.50 26.40 23.6 21.30 16.40 13.60
3 10160355000 1931 -99.99 10.4 -99.99 -99.99 19.2 24.60 -99.99 26.70 22.3 20.00 16.20 11.30
4 10160355000 1932 10.80 10.5 -99.99 14.90 19.1 -99.99 23.60 -99.99 25.1 -99.99 -99.99 -99.99

5 rows × 14 columns

Now that we're able to open our CSV files and create dataframes, we can do some basic number crunching with them. For this project we're not interested in the month-by-month breakdown of temperatures, we just want the average for a year. So for each of our dataframes, we need to take the mean of the months. That's done with the line

typ_data[typ] = typ_data.drop(['id','year'],axis=1).mean(axis=1,skipna=True)

typ_data is a dataframe of the schema we can see above. For each row, we want to average together all the columns except year and id. That's why we call drop -- that removes those two columns for the rest of the calculation. We then take the mean of the rest of the columns, and store it back in the dataframe as typ (a stand-in for tavg, tmin or tmax). The argument axis=1 means that we want to operate across columns, instead of rows which is the default.

We then remove all columns except those we're interested in, with the syntax typ_data = typ_data[['id','year',typ]]. This assigns a new dataframe to typ_data composed only of the columns from the original included in the double brackets.

Next, we merge all three dataframes together one by one. This works just like a join in SQL, and gives you similar options. We need to make sure that all three readings are from the same year are matched together, so we join on id and year. By default, this performs an inner join, so only stations and years with all three types of readings are included in the final dataframe.

Finally, we merge the completed dataframe with the metadata index, giving us a latitude, longitude, station name and country for each data reading.

In [5]:
import numpy as np

data = None

for typ in ["tavg","tmin","tmax"]:
    # -99.99 is used in this dataset to represent missing data, so we replace it with NaN (not a number)
    # This keeps it from being used in calculations.
    typ_data = pd.read_csv("data/csv/"+typ+".csv").replace(-99.99,np.NaN)
    typ_data[typ] = typ_data.drop(['id','year'],axis=1).mean(axis=1,skipna=True)
    typ_data = typ_data[['id','year',typ]]
    if data is None:
        data = typ_data
        data = pd.merge(data, typ_data, on=['year','id'])

index = pd.read_csv("data/csv/tavg-inv.csv")
data = pd.merge(data, index, on=['id'])
data = data.dropna()

id year tavg tmin tmax lat long name country
0 10160355000 1996 19.370000 15.860000 24.200000 36.93 6.95 Skikda Algeria
1 10160355000 1997 18.790000 14.942857 21.828571 36.93 6.95 Skikda Algeria
2 10160355000 1998 18.550000 14.730000 22.460000 36.93 6.95 Skikda Algeria
3 10160355000 1999 19.536364 12.914286 20.600000 36.93 6.95 Skikda Algeria
4 10160355000 2000 19.070000 16.111111 23.933333 36.93 6.95 Skikda Algeria

5 rows × 9 columns

How Many Stations are There, Anyway?

Just for fun, let's take a quick look at the stations we've got.

This plot shows the number of stations with reported readings by year. That big drop in the 90's is interesting!

To produce this, we just need to combine our dataframe with some simple plotting code from matplotlib. All we need for a simple plot like this are two lists: "x" and "y" values. For the "x" values, we simply enumerate all of the years with available data. This is done with data['year'].min() and data['year'].max() -- the brackets in this case select the year column, and .min() and .max() perform the expected aggregation functions.

We then count the number of data points for each year, by calling data[data.year == year].count(). This selects all of the rows whose year matches the year in question, and then counts them. A simple query. :)

The matplotlib code is fairly self-explanatory -- we just pass in the two lists, label the axes, and show the plot.

In [6]:
import matplotlib.pyplot as plt
%matplotlib inline

years = xrange(data['year'].min(),data['year'].max())
counts = [data[data.year == year].count() for year in years]

plt.xlabel('Year'); plt.ylabel('Stations Online')

Finding Averages

The next section finds the average temperature for each station for the periods of 1950-1954 and 2008-2012. Most of these operations we've seen before, but we do introduce two new ones -- queries with multiple selectors and groupby.

early_years = data[(data.year >= 1950) & (data.year < 1955)][['id','tavg']]

This line uses simple ampersand syntax (&) to show that we want to make two selections on the data -- the year should be greater than or equal to 1950, and less than 1955. We also select only the id and tavg columns from the result.

early_years = early_years.groupby('id').mean()

We then use the powerful groupby syntax to find the mean of all the years in the collection that have the same station ID. This gives us the average temperature over the whole range for each station.

Finally, we use a simple subtraction operator to find the difference in averages. dropna is called to remove stations that do not have data available in both ranges.

In [7]:
from datetime import datetime

data = data[(data.year >= 1950) & (data.year <= 2013)]
data['date'] = data['year'].apply(lambda x: datetime(x,1,1))

early_years = data[(data.year >= 1950) & (data.year < 1955)][['id','tavg']]
early_years = early_years.groupby('id').mean()

late_years = data[(data.year >= 2008) & (data.year < 2013)][['id','tavg']]
late_years = late_years.groupby('id').mean()

deltas = (late_years['tavg']-early_years['tavg']).dropna()

We combine the evaluated deltas with the index using the same square brackets to assign a column.

The last line groups by country, and counts the occurances of each. 1219 of the 1529 stations represented are in the USA.

In [8]:
change = index.copy().set_index('id')
change['delta'] = deltas
change = change.dropna()
print "Total stations represented:",len(change)
Total stations represented: 1686
United States Of America 1222
Russian Federation (Asian Sector) 71
China 52
Australia 49
Japan 46
Canada 34
Russian Federation (European Sector) 19
Pakistan 17
Sudan 15
Italy 15
Philippines 14
Thailand 13
Malaysia 11
Ukraine 10
Poland 10
South Africa 9
Republic Of Korea 9
Algeria 6
Turkmenistan 6
Uzbekistan 6
Kazakhstan 5
Zambia 3
Belarus 3
Egypt 3
Federated States Of Micronesia 3
Bangladesh 3
Zimbabwe 2
Libya 2
Lithuania 2
Tanzania 2
Nigeria 2
Cocos Islands (Australia) 1
Antarctica 1
Armenia 1
Wake Island (U.S.A.) 1
Austria 1
Saudi Arabia 1
Belau 1
Bermuda (U.K.) 1
Portugal 1
Northern Mariana Islands (U.S.A.) 1
Latvia 1
Kyrgyzstan 1
Cuba 1
Czech Republic 1
Norfolk Island (Australia) 1
Estonia 1
Ghana 1
Iceland 1
Indonesia 1
Moldova 1
Marshall Islands 1
Coral Sea Islands (Australia) 1

53 rows × 1 columns

describe gives you standard dataset summary data.

In [9]:
print change['delta'].describe()
count    1686.000000
mean        0.570878
std         1.711249
min       -14.261167
25%         0.051986
50%         0.661217
75%         1.250096
max         9.831667
Name: delta, dtype: float64

matplotlib is great for quick analysis and visualizations. Let's plot a histogram of our temperature deltas.

In [10]:
plt.title('Average Temperature Change, 1950-2010')
plt.xlabel('Delta Temperature (C)')
plt.ylabel('Count of Stations')

Mapping the Data

The final step is mapping the data. There are three major steps to this process: (1) Setting the color of each marker, (2) creating the temperature over time graph for each marker, and (3) plotting the markers on the map.

Coloring the Markers

We want to assign a color, blue or red, to each marker depending on its change over the period. The first step is to discretize the deltas into 7 equally-sized categories. Because most stations had a change of less than +/-5° C, we will use that as the limits of our scale. The first two lines find any changes of greater than 5° and cap them. After that, the cut method is used to discretize the delta column into 7 categories. I chose a 7-tone blue-to-red scale from ColorBrewer to represent the colors of the markers, which will be assigned based on the discretized delta.

In [11]:[ > 5] = 5[ < -5] = -5

change['cat'] = pd.cut(change['delta'], 7)
cats = change.sort('delta')['cat'].unique()
colors = ["#2166ac","#67a9cf","#d1e5f0","#f7f7f7","#fddbc7","#ef8a62","#b2182b"]
cat_colors = {cat:color for cat, color in zip(cats,colors)}
['(-5.01, -3.571]',
 '(-3.571, -2.143]',
 '(-2.143, -0.714]',
 '(-0.714, 0.714]',
 '(0.714, 2.143]',
 '(2.143, 3.571]',
 '(3.571, 5]']

Creating the Charts

Visualization is done with folium for the map and vincent for charts. These both have simple and well-documented APIs.

To create the graph for each station, we first select the readings that pertain to the station of interest, using the familiar bracket syntax for selection. The next line does several things.

weather_frame = weather_frame.rename(columns={'tmax':'Max','tmin':'Min','tavg':'Average'}).set_index('date')

We first have to rename the columns to human-friendly names like Max, Min and Average. Vincent will use these column names in its legend. Next, we set the dataframe index to date. This lets vincent know that we want a time-series graph with the date on the X axis.

Vincent doesn't need any more configuration than that, so we can simply create a line graph with chart = vincent.Line(weather_frame). The rest of the code performs fairly standard formatting of the chart, and is all found in the vincent API linked above.

In [12]:
import folium, vincent

def popup_chart(station_id):
    weather_frame = data[ == station_id][['date','tmax','tavg','tmin']]
    weather_frame = weather_frame.rename(columns={'tmax':'Max','tmin':'Min','tavg':'Average'}).set_index('date')
    station = index[ == station_id].iloc[0] # Find the station in the index

    chart = vincent.Line(weather_frame)
    chart.axis_titles(x=station['name']+", "+station["country"],y="Temp (C)")
    chart.title=station['name']+", "+station["country"]

    return chart    

chart = popup_chart(42572734000)

Putting it All Together: the Map

With folium, creating the map itself is fairly simple! The configuration options, such as the default zoom level and tile set to use, are found in the folium docs.

We then iterate through all our stations, and add a marker for each one. The popup is set by generating a vincent chart in the above manner for each data point. The fill color is set with the previously generated cat_colors dictionary. We save the final map to disk, and we are good to go!

(Note: the final map won't show up in the online ipython notebook viewer because it relies on Javascript. However, you can see it here.)

In [13]:
f_map = folium.Map(
    tiles="Mapbox Bright")

for station_id,station in change.iterrows():