The Integrated Crisis Early Warning System (ICEWS) is a machine-coded event dataset developed by Lockheed Martin and others for DARPA and the Office of Naval Research. For a long time, ICEWS was available only within the Department of Defense, and to a few select academics. Now, for the first time, a checkpointed version of ICEWS is being released to the general public (or, at least, the parts of the general public that care about political event data).
Unlike some event data sets, the public version of ICEWS will only be updated annually or so, but it still includes almost 20 years worth of event data that's been used successfully both in the government and academic research.
This document is mostly a cleaned-up version of my own initial exploration of the dataset. Hopefully it'll prove useful to others who want to use ICEWS in their own research.
UPDATE (03/29/15): Jennifer Lautenschlager, from the ICEWS team at Lockheed Martin, was kind enough to provide some clarifications, which I've added.
This is done in Python 3.4.2, with pandas version 0.15.2. The only requirement that might be tricky to install is Basemap, which is only used for the mapping section. You won't miss much without it.
import os
from collections import defaultdict
# Other libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
# Show plots inline
%matplotlib inline
The data is available via the Harvard Dataverse, at http://thedata.harvard.edu/dvn/dv/icews. The two datasets I use are the ICEWS Coded Event Data and the Ground Truth Data Set. The easiest way to download both is to go to the Data & Analysis tab, click Select all files at the top, and then Download Selected Files.
The ICEWS event data comes as one file per year, initially zipped. On OSX or Linux, you can unzip all the files in a directory at once from the terminal with
$ unzip "*.zip"
And you can delete all the zipped files with
$ rm *.zip
In this document, I assume that all the annual data files, as well as the one Ground Truth data file, are in the same directory.
# Path to directory where the data is stored
DATA = "/Users/dmasad/Data/ICEWS/"
For testing purposes, I start by loading a single year into a pandas DataFrame. The data files are tab-delimited, and have the column names as the first row.
one_year = pd.read_csv(DATA + "events.1995.20150313082510.tab", sep="\t")
one_year.head()
Event ID | Event Date | Source Name | Source Sectors | Source Country | Event Text | CAMEO Code | Intensity | Target Name | Target Sectors | Target Country | Story ID | Sentence Number | Publisher | City | District | Province | Country | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 926685 | 1995-01-01 | Extremist (Russia) | Radicals / Extremists / Fundamentalists,Dissident | Russian Federation | Praise or endorse | 51 | 3.4 | Boris Yeltsin | Elite,Executive,Executive Office,Government | Russian Federation | 28235806 | 5 | The Toronto Star | Moscow | NaN | Moskva | Russian Federation | 55.7522 | 37.6156 |
1 | 926687 | 1995-01-01 | Government (Bosnia and Herzegovina) | Government | Bosnia and Herzegovina | Express intent to cooperate | 30 | 4.0 | Citizen (Serbia) | General Population / Civilian / Social,Social | Serbia | 28235807 | 1 | The Toronto Star | NaN | NaN | Bosnia | Bosnia and Herzegovina | 44.0000 | 18.0000 |
2 | 926686 | 1995-01-01 | Citizen (Serbia) | General Population / Civilian / Social,Social | Serbia | Express intent to cooperate | 30 | 4.0 | Government (Bosnia and Herzegovina) | Government | Bosnia and Herzegovina | 28235807 | 1 | The Toronto Star | NaN | NaN | Bosnia | Bosnia and Herzegovina | 44.0000 | 18.0000 |
3 | 926688 | 1995-01-01 | Canada | NaN | Canada | Praise or endorse | 51 | 3.4 | City Mayor (Canada) | Government,Local,Municipal | Canada | 28235809 | 3 | The Toronto Star | NaN | NaN | Ontario | Canada | 49.2501 | -84.4998 |
4 | 926689 | 1995-01-01 | Lawyer/Attorney (Canada) | Legal,Social | Canada | Arrest, detain, or charge with legal action | 173 | -5.0 | Police (Canada) | Government,Police | Canada | 28235964 | 1 | The Toronto Star | Montreal | Montreal | Quebec | Canada | 45.5088 | -73.5878 |
one_year.dtypes
Event ID int64 Event Date object Source Name object Source Sectors object Source Country object Event Text object CAMEO Code int64 Intensity float64 Target Name object Target Sectors object Target Country object Story ID int64 Sentence Number int64 Publisher object City object District object Province object Country object Latitude float64 Longitude float64 dtype: object
Looks pretty good! Notice that the Event Date column is an object (meaning a string), so when we load in all of the data we should tell pandas to parse it automatically.
The ICEWS data isn't too big to hold in memory all at once, so I go ahead and load the entire thing. To do it, we'll iterate over all the data files, read each into a DataFrame, and then concatenate them together.
Note that in this code, I added the parse_dates=[1] argument to the .read_csv(...) method, telling pandas to parse the second column as a date.
This code assumes that the ICEWS data files are the only .tab files in your DATA directory. If that isn't the case, adjust as needed.
all_data = []
for f in os.listdir(DATA): # Iterate over all files
if f[-3:] != "tab": # Skip non-tab files.
continue
df = pd.read_csv(DATA + f, sep='\t', parse_dates=[1])
all_data.append(df)
data = pd.concat(all_data)
Some of the ICEWS column names have spaces in them, which means they can't be referenced using pandas's period notation. To fix this, I rename the columns to replace the spaces with underscores:
cols = {col: col.replace(" ", "_") for col in data.columns}
data.rename(columns=cols, inplace=True)
data.dtypes
Event_ID int64 Event_Date datetime64[ns] Source_Name object Source_Sectors object Source_Country object Event_Text object CAMEO_Code int64 Intensity float64 Target_Name object Target_Sectors object Target_Country object Story_ID int64 Sentence_Number int64 Publisher object City object District object Province object Country object Latitude float64 Longitude float64 dtype: object
print(data.Event_Date.min())
print(data.Event_Date.max())
1995-01-01 00:00:00 2014-02-28 00:00:00
len(data)
13514121
Looks good! The data types are what we expect, and the dates seem to have been parsed correctly.
A good initial examination of the data is seeing who the most frequent actors are. The following code counts how often each actor appears as the source or target of an event:
actors_source = data.Source_Name.value_counts()
actors_target = data.Target_Name.value_counts()
actor_counts = pd.DataFrame({"SourceFreq": actors_source,
"TargetFreq": actors_target})
actor_counts.fillna(0, inplace=True)
actor_counts["Total"] = actor_counts.SourceFreq + actor_counts.TargetFreq
Now let's look at the top 50 actors. For people like me who are more used to GDELT and Phoenix, the actor list might look a little different than what we expect:
actor_counts.sort("Total", ascending=False, inplace=True)
actor_counts.head(50)
SourceFreq | TargetFreq | Total | |
---|---|---|---|
United States | 330446 | 341603 | 672049 |
Russia | 195571 | 260635 | 456206 |
China | 192944 | 254747 | 447691 |
Israel | 116427 | 150810 | 267237 |
Japan | 103651 | 145657 | 249308 |
India | 96871 | 147406 | 244277 |
Iran | 84099 | 150208 | 234307 |
Citizen (India) | 65350 | 136966 | 202316 |
United Nations | 92022 | 107413 | 199435 |
Unspecified Actor | 0 | 198718 | 198718 |
European Union | 90824 | 100310 | 191134 |
Iraq | 48025 | 128246 | 176271 |
Vladimir Putin | 102453 | 73104 | 175557 |
George W. Bush | 100763 | 72909 | 173672 |
North Korea | 65394 | 105958 | 171352 |
Turkey | 64995 | 104946 | 169941 |
South Korea | 69509 | 89696 | 159205 |
Pakistan | 57450 | 98419 | 155869 |
Police (India) | 117726 | 32771 | 150497 |
United Kingdom | 65195 | 81465 | 146660 |
Palestinian Territory, Occupied | 39861 | 101029 | 140890 |
France | 61803 | 72704 | 134507 |
Australia | 41393 | 71031 | 112424 |
Afghanistan | 31678 | 75604 | 107282 |
North Atlantic Treaty Organization | 51437 | 52715 | 104152 |
Syria | 33110 | 64113 | 97223 |
Germany | 42116 | 51114 | 93230 |
Egypt | 34961 | 47969 | 82930 |
Barack Obama | 46660 | 34782 | 81442 |
Indonesia | 28997 | 43795 | 72792 |
Georgia | 26227 | 45432 | 71659 |
Hu Jintao | 39679 | 28938 | 68617 |
Citizen (Palestinian Territory, Occupied) | 19921 | 48444 | 68365 |
Yasir Arafat | 31196 | 35228 | 66424 |
Citizen (Russia) | 23473 | 42401 | 65874 |
Thailand | 25541 | 40200 | 65741 |
Mahmoud Abbas | 34629 | 29995 | 64624 |
Citizen (Australia) | 22541 | 41030 | 63571 |
Serbia | 21006 | 41502 | 62508 |
Taliban | 31678 | 30688 | 62366 |
Government (India) | 12117 | 49629 | 61746 |
Israeli Defense Forces | 42364 | 19130 | 61494 |
UN Security Council | 27184 | 34210 | 61394 |
Ukraine | 23290 | 37906 | 61196 |
Taiwan | 19421 | 41052 | 60473 |
Kofi Annan | 37150 | 22442 | 59592 |
Vietnam | 24456 | 34951 | 59407 |
Tony Blair | 33083 | 25713 | 58796 |
Lebanon | 17041 | 40094 | 57135 |
Mexico | 24690 | 32142 | 56832 |
What stood out to me was the mix of country-level actors with named individuals. Unlike event datasets that use CAMEO coding, leaders or sub-state organizations don't seem to be coded as add-ons to a state actor code (e.g. USAGOV) but separate actors in their own right.
Update (03/29/2015): The _Sectors column contains the role information that would otherwise be contained in the chained CAMEO designations. For example, if you scroll back to the first row of 1995 data, the target name is Boris Yeltsin, and the target sectors associated with him are "Elite,Executive,Executive Office,Government".
The Citizen (Country) actor stood out to me in particular, especially since it isn't mentioned specifically in the included documentation -- so let's take a look at some of the rows that use it:
data[data.Source_Name=="Citizen (India)"].head()
Event_ID | Event_Date | Source_Name | Source_Sectors | Source_Country | Event_Text | CAMEO_Code | Intensity | Target_Name | Target_Sectors | Target_Country | Story_ID | Sentence_Number | Publisher | City | District | Province | Country | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
676 | 927826 | 1995-01-11 | Citizen (India) | General Population / Civilian / Social,Social | India | Reject proposal to meet, discuss, or negotiate | 125 | -5.0 | Narasimha Rao | Executive Office,Executive,Government | India | 28239021 | 2 | The Associated Press Political Service | NaN | NaN | State of Tamil Nadu | India | 11.0000 | 78.0000 |
783 | 927996 | 1995-01-12 | Citizen (India) | General Population / Civilian / Social,Social | India | Express intent to meet or negotiate | 36 | 4.0 | United States | NaN | United States | 28239081 | 1 | The Associated Press Political Service | Swanton | Saline County | Nebraska | United States | 40.3781 | -97.0728 |
2151 | 2547954 | 1995-01-26 | Citizen (India) | Social,General Population / Civilian / Social | India | Demonstrate or rally | 141 | -6.5 | Unspecified Actor | Unspecified | NaN | 28242253 | 2 | Reuters News | Jammu | NaN | State of Jammu and Kashmir | India | 32.7357 | 74.8691 |
5760 | 935070 | 1995-03-06 | Citizen (India) | Social,General Population / Civilian / Social | India | Kill by physical assault | 1823 | -10.0 | Congress Party | (National) Major Party,Government Major Party ... | India | 28915932 | 4 | The Associated Press Political Service | Hyderabad | Hyderabad | State of Andhra Pradesh | India | 17.3841 | 78.4564 |
5766 | 935081 | 1995-03-06 | Citizen (India) | Social,General Population / Civilian / Social | India | Use unconventional violence | 180 | -9.0 | Militant (India) | Unidentified Forces | India | 28915955 | 6 | The Associated Press Political Service | New Delhi | NaN | National Capital Territory of Delhi | India | 28.6358 | 77.2244 |
So it looks like Citizen really means civilians, or possibly civil society actors unaffiliated with any organization the ICEWS coding system recognizes.
Update (03/29/2015): I had trouble finding news events that corresponded to the events above, but Jennifer Lautenschlager pointed me to this news article that indicates that there was election violence in India in that time frame.
To get country-level actors comparable to other event datasets, looks like we need to use the source and target country columns:
country_source = data.Source_Country.value_counts()
country_target = data.Target_Country.value_counts()
country_counts = pd.DataFrame({"SourceFreq": country_source,
"TargetFreq": country_target})
country_counts.fillna(0, inplace=True)
country_counts["Total"] = country_counts.SourceFreq + country_counts.TargetFreq
country_counts.sort("Total", ascending=False, inplace=True)
country_counts.head(10)
SourceFreq | TargetFreq | Total | |
---|---|---|---|
United States | 997696 | 803460 | 1801156 |
India | 773712 | 760583 | 1534295 |
Russian Federation | 746829 | 706700 | 1453529 |
China | 541432 | 525955 | 1067387 |
Japan | 344413 | 332380 | 676793 |
Australia | 340339 | 320329 | 660668 |
Israel | 338118 | 315501 | 653619 |
United Kingdom | 331735 | 302389 | 634124 |
Occupied Palestinian Territory | 251678 | 317883 | 569561 |
Iran | 286274 | 283276 | 569550 |
This looks pretty good too! India seems more represented compared to what I've seen in other datasets, and of course Israel/Palestine maintain their usual place on the event data leaderboard.
Update (03/29/2015): Since the Sectors are also an important way of understanding the relevant data, let's get their frequencies too. Sectors are a bit trickier, since each cell can contain multiple selectors, separated by commas. So we need to loop over each cell, split the selectors mentioned, and count each one.
# Count source sectors
source_sectors = defaultdict(int)
source_sector_counts = data.Source_Sectors.value_counts()
for sectors, count in source_sector_counts.iteritems():
sectors = sectors.split(",")
for sector in sectors:
source_sectors[sector] += 1
# Count Target sectors
target_sectors = defaultdict(int)
target_sector_counts = data.Target_Sectors.value_counts()
for sectors, count in target_sector_counts.iteritems():
sectors = sectors.split(",")
for sector in sectors:
target_sectors[sector] += 1
# Convert into series
source_sectors = pd.Series(source_sectors)
target_sectors = pd.Series(target_sectors)
# Combine into a dataframe, and fill missing with 0
sector_counts = pd.DataFrame({"SourceFreq": source_sectors,
"TargetFreq": target_sectors})
sector_counts.fillna(0, inplace=True)
sector_counts["Total"] = sector_counts.SourceFreq + sector_counts.TargetFreq
sector_counts.sort("Total", ascending=False, inplace=True)
sector_counts.head(10)
SourceFreq | TargetFreq | Total | |
---|---|---|---|
Government | 176897 | 138684 | 315581 |
Parties | 171411 | 135383 | 306794 |
Ideological | 153750 | 121842 | 275592 |
(National) Major Party | 134134 | 106262 | 240396 |
Executive | 129382 | 103265 | 232647 |
Elite | 92926 | 80163 | 173089 |
Legislative / Parliamentary | 63654 | 45670 | 109324 |
Executive Office | 54710 | 49770 | 104480 |
Cabinet | 57273 | 42038 | 99311 |
Center Left | 48678 | 37593 | 86271 |
sector_counts.tail(10)
SourceFreq | TargetFreq | Total | |
---|---|---|---|
International Exiles | 2 | 1 | 3 |
Bedouin | 2 | 0 | 2 |
Nepali-Pahari | 1 | 1 | 2 |
Western | 1 | 1 | 2 |
Navy Headquarters | 1 | 1 | 2 |
Army Education / Training | 0 | 1 | 1 |
Unspecified | 0 | 1 | 1 |
Consumer Services MNCs | 1 | 0 | 1 |
State-Owned Consumer Goods | 1 | 0 | 1 |
Saharan | 1 | 0 | 1 |
In addition to CAMEO-type actor designations (e.g. Government) it looks like some of the Sectors also resemble the Issues in Phoenix, or Themes in the GDELT GKG.
An easy way to get an idea of whether there were significant changes in data collection over time is to look at total events over time. ICEWS events have the full daily date only, so let's go with that and look at daily events.
daily_events = data.groupby("Event_Date").aggregate(len)["Event_ID"]
daily_events.plot(color='k', lw=0.2, figsize=(12,6),
title="ICEWS Daily Event Count")
<matplotlib.axes._subplots.AxesSubplot at 0x1044704e0>
There seems to be a definite ramp-up period from 1995 through 1999 or so, and some sort of fall in event volume around 2009. Notice that there are also a few individual days, especially around 2004, with very few events for some reason.
Update (03/29/2015): Jennifer Lautenschlager clarified that the jumps in the 1995-2001 period reflect publishers entering incrementally into the commercial data system that feeds into ICEWS. The post-2008 dip reflects a decline in number of stories overall, possibly driven by budget cuts due to the recession.
Since each event has an associated Story ID, we can count how many unique stories are processed by ICEWS every day and end up generating events.
daily_stories = data.groupby("Event_Date").aggregate(pd.Series.nunique)["Story_ID"]
daily_stories.plot(color='k', lw=0.2, figsize=(12,6),
title="ICEWS Daily Story Count")
<matplotlib.axes._subplots.AxesSubplot at 0x11a31dfd0>
With these two series, we can measure the daily average events generated per story:
events_per_story = daily_events / daily_stories
events_per_story.plot(color='k', lw=0.2, figsize=(12,6),
title="ICEWS Daily Events Per Story")
<matplotlib.axes._subplots.AxesSubplot at 0x11a317080>
This confirms that indeed, except for a few anomalies, the number of events generated per story stays relatively consistent over time. Nevertheless, it's probably important to at least try to distinguish between fewer stories as caused by fewer newsworthy events, and fewer stories as caused by fewer journalists writing them.
Another good way to get an idea of the dataset's coverage is to put the events on a map. To do that, let's group the data by the latitude and longitude for each event, and count the number of events at each point. Then we can put those points on a world map using the basemap package.
points = data.groupby(["Latitude", "Longitude"]).aggregate(len)["Event_ID"]
points = points.reset_index()
Nobody will be surprised that the distribution of events-per-point is very long-tailed, with many points having only a small number of events, and a small number of points having hundreds of thousands of events.
points.Event_ID.hist()
plt.yscale('log')
So the best way to deal with this is to plot point size based on the log of the number of events recorded there.
The following code draws a world map using Basemap's default, built-in map, and then iterates over all the points, putting a dot on the map for each one. Finally, it exports the resulting map to a PNG file
plt.figure(figsize=(16,16))
# Draw the world map itself
m = Basemap(projection='eck4',lon_0=0,resolution='c')
m.drawcoastlines()
m.fillcontinents()
# draw parallels and meridians.
m.drawparallels(np.arange(-90.,120.,30.))
m.drawmeridians(np.arange(0.,360.,60.))
m.drawmapboundary()
m.drawcountries()
plt.title("ICEWS Total Events", fontsize=24)
# Plot the points
for row in points.iterrows():
row = row[1]
lat = row.Latitude
lon = row.Longitude
count = np.log10(row.Event_ID + 1) * 2
x, y = m(lon, lat) # Convert lat-long to plot coordinates
m.plot(x, y, 'ro', markersize=count, alpha=0.3)
plt.savefig("ICEWS.png", dpi=120, facecolor="#FFFFFF")