Finding Heavy Traffic Indicators on I-94


In this project, we're going to analyze a dataset about the westbound traffic on the I-94 Interstate highway connecting the Great Lakes and northern Great Plains regions of the U.S. The dataset was made available by John Hogue and can be downloaded from this repository.

The goal of our analysis is to determine a few indicators of heavy traffic on I-94, such as weather type, day of the week, hour, etc.

Summary of Results

We found out that the traffic is most intense in the daytime, warm months, and business days, especially 6.00-8.00 and 16.00-17.00. Temperature doesn't influence traffic intensity, while some relatively light weather conditions do. The lowest average traffic volume is related to 2016, followed by the maximum peak in 2017. Of all the holidays, the heaviest traffic is related to Columbus Day, the lightest one – to Christmas Day and New Year.

Dataset Downloading and Initial Analysis

In [1]:
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
0 None 288.28 0.0 0.0 40 Clouds scattered clouds 2012-10-02 09:00:00 5545
1 None 289.36 0.0 0.0 75 Clouds broken clouds 2012-10-02 10:00:00 4516
2 None 289.58 0.0 0.0 90 Clouds overcast clouds 2012-10-02 11:00:00 4767
3 None 290.13 0.0 0.0 90 Clouds overcast clouds 2012-10-02 12:00:00 5026
4 None 291.14 0.0 0.0 75 Clouds broken clouds 2012-10-02 13:00:00 4918
In [2]:
holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
48199 None 283.45 0.0 0.0 75 Clouds broken clouds 2018-09-30 19:00:00 3543
48200 None 282.76 0.0 0.0 90 Clouds overcast clouds 2018-09-30 20:00:00 2781
48201 None 282.73 0.0 0.0 90 Thunderstorm proximity thunderstorm 2018-09-30 21:00:00 2159
48202 None 282.09 0.0 0.0 90 Clouds overcast clouds 2018-09-30 22:00:00 1450
48203 None 282.12 0.0 0.0 90 Clouds overcast clouds 2018-09-30 23:00:00 954
In [3]:
print(f'\033[1mNumber of rows:\033[0m\t\t  {traffic.shape[0]:,}'
      f'\n\033[1mNumber of columns:\033[0m \t  {traffic.shape[1]:,}'
      f'\n\033[1mNumber of missing values:\033[0m {traffic.isnull().sum().sum()}'
      f'\n\n\033[1mCOLUMN NAMES:\033[0m \n{traffic.columns.to_list()}'
      f'\n\n\033[1mDATA TYPES:\033[0m \n{traffic.dtypes}')
Number of rows:		  48,204
Number of columns: 	  9
Number of missing values: 0

['holiday', 'temp', 'rain_1h', 'snow_1h', 'clouds_all', 'weather_main', 'weather_description', 'date_time', 'traffic_volume']

holiday                 object
temp                   float64
rain_1h                float64
snow_1h                float64
clouds_all               int64
weather_main            object
weather_description     object
date_time               object
traffic_volume           int64
dtype: object

There are 48,204 entries and 9 columns in the dataset, and no NaN values. The column holiday seems to contain a lot of 'None' values, though. Below are the descriptions of each column from the documentation:

  • holiday – US National holidays plus regional holiday, Minnesota State Fair,
  • temp – average temp in kelvin,
  • rain_1h – the amount in mm of rain per hour,
  • snow_1h – the amount in mm of snow per hour,
  • clouds_all – % of cloud cover,
  • weather_main – a short textual description of the current weather,
  • weather_description – a longer textual description of the current weather,
  • date_time – the time of the data collected in local CST time,
  • traffic_volume – hourly I-94 ATR 301 reported westbound traffic volume.

Analyzing Traffic Volume

The dataset documentation mentions that a station located approximately midway between Minneapolis and Saint Paul recorded the traffic data. Also, the station only records cars moving from east to west.


This means that the results of our analysis will be about the westbound traffic in the proximity of that station, so we should avoid generalizing our results for the entire I-94 highway.

Let's plot the distribution of traffic volume:

In [4]:

def create_hist(df, bins, color, title, title_font, axis_font, tick_font):
    plt.hist(df['traffic_volume'], bins=bins, color=color)
    plt.title(title, fontsize=title_font)
    plt.xlabel('Traffic volume, cars/hr', fontsize=axis_font)
    plt.ylabel('Frequency', fontsize=axis_font)

# Plotting the overall distribution of traffic volume
create_hist(df=traffic, bins=24, color='slateblue',
            title='Traffic volume distribution',
            title_font=30, axis_font=20, tick_font=14)
count    48204.0
mean      3260.0
std       1987.0
min          0.0
25%       1193.0
50%       3380.0
75%       4933.0
max       7280.0
Name: traffic_volume, dtype: float64

Some observations here:

  • The distribution is multimodal slightly right-skewed, with 3 peaks: approximately 500, 3,000, and 5,000 cars/hr.
  • Overall, the number of cars per hour varies from 0 to 7,280, with an average of 3,260 cars/hr.
  • The most frequent values of traffic volume range from 300 to 1,000, from 2,500 to 3,500, and from 4,500 to 6,300.
  • About 25% of the time, 1,193 or fewer cars were passing the station each hour.
  • About 25% of the time, 4,933 or more cars were passing the station each hour.

At this point, we can assume that traffic volume is strongly influenced by daytime and nighttime. In particular, in the daytime, most probably, there are distinct periods of moderate traffic (related to working hours) and heavy traffic with traffic jams (related to the morning and evening hours when people usually go to or from work). So, let's compare daytime with nighttime data.

Traffic Volume: Day vs. Night

We'll start by dividing the dataset into two parts:

  • Daytime data: from 7.00-19.00.
  • Nighttime data: from 19.00-7.00.

While this is not a perfect criterion for distinguishing between nighttime and daytime, it's a good starting point.

In [5]:
traffic['date_time'] = pd.to_datetime(traffic['date_time'])

# Dividing the dataset into daytime and nighttime subsets
day = traffic.copy()[(traffic['date_time'].dt.hour>=7)&(traffic['date_time'].dt.hour<19)]
night = traffic.copy()[(traffic['date_time'].dt.hour<7)|(traffic['date_time'].dt.hour>=19)]

Now, we're going to compare traffic volume at night and during the day:

In [6]:
dfs = [day, night]
titles = ['Traffic volume: Day', 'Traffic volume: Night']
colors = ['deepskyblue', 'darkblue']

# Plotting the distributions of traffic volume at night and during the day
for i in range(1,3):
    plt.subplot(1, 2, i)
    create_hist(df=dfs[i-1], bins=20, color=colors[i-1], title=titles[i-1],
                title_font=22, axis_font=16, tick_font=11)

day_stats = round(day['traffic_volume'].describe())
night_stats = round(night['traffic_volume'].describe())

print(f'\033[1mDAYTIME STATS:\033[0m'
      f'\n\n\033[1mNIGHTTIME STATS:\033[0m'
count    23877.0
mean      4762.0
std       1175.0
min          0.0
25%       4252.0
50%       4820.0
75%       5559.0
max       7280.0
Name: traffic_volume, dtype: float64

count    24327.0
mean      1785.0
std       1442.0
min          0.0
25%        530.0
50%       1287.0
75%       2819.0
max       6386.0
Name: traffic_volume, dtype: float64

Let's analyze the features for daytime and nighttime separately.


  • The distribution for daytime is unimodal left-skewed.
  • The majority of traffic volume values lie between 4,000 and 6,500 cars/hr, with an average of 4,762 cars/hr.
  • About 25% of the daytime, traffic volume was 4,252 cars/hr or lower.
  • On the other hand, about 25% of the daytime, it was 5,559 cars/hr or higher.


  • The distribution for nighttime is multimodal right-skewed, with 3 peaks: approximately 500 (the highest peak), 3,000, and 5,500 (the lowest peak) cars/hr. This form of distribution is responsible for the multimodal distribution with 3 peaks that we saw earlier, in particular, for its leftmost two peaks.
  • The majority of traffic volume values lie between 0 and 1,000 cars/hr, and, secondarily, between 2,500 and 3,500 cars/hr, with an average of 1,785 cars/hr, which is 2.7 times lower than that for daytime.
  • About 25% of the nighttime, the traffic volume was 530 cars/hr or lower.
  • On the other hand, about 25% of the daytime, it was 2,819 cars/hr or higher.
  • The multimodal form of distribution is, most probably, due to the fact that what we considered as "night" is indeed quite heterogeneous: from 19.00 till 7.00, we actually have:
    • two short periods of time prone to traffic jams, when people go to work or back home,
    • a period of very light traffic at deep night,
    • something in between, including the early morning time.

All in all, from the histograms above, we can conclude that in general, the night traffic is much less intense than that of daytime. Since our goal is to find indicators of heavy traffic, let's focus on the daytime.

Time Indicators

One of the possible indicators of heavy traffic is time. There might be more people on the road in a certain month, on a certain day, or at a certain hour.

Let's check how traffic volume changed by different time units: year, month, day of the week, or hour.

Traffic Volume by Year

In [7]:
day['year'] = day['date_time'].dt.year
by_year = day.groupby('year').mean()

def create_line_plot(df, title, xlabel, tick_min=None, tick_max=None, labels=None,
                     xmin=None, xmax=None, ymin=None, ymax=None, 
                     title_font=30, color='slateblue', 
                     linestyle=None, marker=None, rotation=None):
    plt.plot(df['traffic_volume'], color=color, linestyle=linestyle, marker=marker)
    plt.title(title, fontsize=title_font)
    plt.xlabel(xlabel, fontsize=20)
    plt.ylabel('Traffic volume, cars/hr', fontsize=20)
    if tick_max:
        ticks=[i for i in range(tick_min, tick_max+1)]
    plt.xticks(ticks=ticks, labels=labels, fontsize=15, rotation=rotation)

# Plotting traffic volume by year
                 title='Traffic volume by year', xlabel='Year')