Quick Investigation - Road Traffic Accidents in Milton Keynes¶

This notebook has been updated from an earlier example which has rotted over time. IT IS OF HISTORICAL INTEREST ONLY AND THE CODE MAY NOT RUN, NOT LEAST BECUAE THE DATA FILES IT RELIES ON HAVE DISAPPEARED. It includes notes showing where the original notebook had broken, and fixes made to it. In many cases, looking to web resources for assistance will turn up thimgs that are almost what you want, or not quite, or that worked once, and have since rotted. An effective practitioner will then take what they can for the resource, or if only minimal fixes are required, make them themselves and prerhaps contribute them back. That's what open practice is all about.

This example data investigation represents a preliminary exploration of road traffic accident data for Milton Keynes. Feel free to use it as a basis for a quick exploration of the data for your local area.

Note that the conversation descends into Unix commandline arcana in several places to parse some of the downloaded text files. We will not be formally teaching you how to use Unix commandline commands in this module. However, in many cases, you should be able to copy the commands and just change the bits that are obviously parameter values to use values of your own. Alternatively, you may choose to use a different technique to achieve a similar effect. As with many data wrangling exercises, there is typically more than one way to do it...

First locate some data... A quick search identifies Stats19 as a source of RTA incident data in the UK. We can find data files on data.gov.uk:

In [1]:

#Road safety data
#http://data.gov.uk/dataset/road-accidents-safety-data
#We can use the Linux commandline command wget to retrieve a datafile from a remote URL
!wget http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19-Data2005-2012.zip

--2021-11-03 13:40:29--  http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19-Data2005-2012.zip
Resolving data.dft.gov.uk.s3.amazonaws.com (data.dft.gov.uk.s3.amazonaws.com)... 52.218.97.145
Connecting to data.dft.gov.uk.s3.amazonaws.com (data.dft.gov.uk.s3.amazonaws.com)|52.218.97.145|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-11-03 13:40:29 ERROR 403: Forbidden.

Okay, so that URL has rotted. Running a web search for something like "Stats19 data" might turn up a current source, such as https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data (how hard was that?).

Alternatively, we could check the Internet Archive with the zip URL, which turns up an archived copy:

In [8]:

# Use the web archive URL
!wget https://web.archive.org/web/20140907153326/http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19-Data2005-2012.zip

--2021-11-03 13:53:42--  https://web.archive.org/web/20140907153326/http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19-Data2005-2012.zip
Resolving web.archive.org (web.archive.org)... 207.241.237.3
Connecting to web.archive.org (web.archive.org)|207.241.237.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-zip-compressed]
Saving to: ‘Stats19-Data2005-2012.zip’

Stats19-Data2005-20     [               <=>  ]  72.70M  1.31MB/s    in 47s     

2021-11-03 13:54:31 (1.53 MB/s) - ‘Stats19-Data2005-2012.zip’ saved [76235891]

Let's unzip the data and then preview it:

In [9]:

#Create a directory for this investigation
!mkdir -p data/accidents

In [10]:

#The Linux commandline command unzip will unzip the downloaded file.
#The -d flag sets the target directory into which the unzipped files should be placed.
!unzip -d data/accidents Stats19-Data2005-2012.zip

Archive:  Stats19-Data2005-2012.zip
  inflating: data/accidents/Accidents0512.csv  
  inflating: data/accidents/Vehicles0512.csv  
  inflating: data/accidents/Casualty0512.csv  
  inflating: data/accidents/Road-Accident-Safety-Data-Guide-1979-2004.xls

In [13]:

#%pip install pandas
import pandas as pd

In [14]:

#Preview the first few lines of the data file to get a feel for what's in there
#We could use the commandline command
#!head -n 3 data/accidents/Accidents0512.csv
#Or we can use pandas
pd.read_csv('data/accidents/Accidents0512.csv',nrows=3)

Out[14]:

	Accident_Index	Location_Easting_OSGR	Location_Northing_OSGR	Longitude	Latitude	Police_Force	Accident_Severity	Number_of_Vehicles	Number_of_Casualties	Date	...	Pedestrian_Crossing-Physical_Facilities	Light_Conditions	Weather_Conditions	Road_Surface_Conditions	Urban_or_Rural_Area	Did_Police_Officer_Attend_Scene_of_Accident	LSOA_of_Accident_Location
0	200501BS00001	525680	178240	-0.191170	51.489096	1	2	1	1	04/01/2005	...	1	1	2	2	1	1	E01002849
1	200501BS00002	524170	181650	-0.211708	51.520075	1	3	1	1	05/01/2005	...	5	4	1	1	1	1	E01002909
2	200501BS00003	524520	182240	-0.206458	51.525301	1	3	2	1	06/01/2005	...	0	4	1	1	1	1	E01002857

3 rows × 32 columns

The accident location is given as a lat/long co-ordinate, and several administrative region codes are given as well: in particular, the corresponding LSOA - Lower Super Output Area - in which the accident occurred. This is a low level administrative geography region. (If you work with UK administrative datasets, you will find it useful to familiarise yourself with the administrative geograhy of the UK. A good place to start is the ONS guide to UK administrative geography.)

We can try find the LSOAs in the Milton Keynes administrative area by looking up the codes with a search over an LSOA catalogue file using the search term Milton Keynes. So where can we find administrative area data? How about ONS - the Office of National Statistics?

In [15]:

#LSOA areas
#https://geoportal.statistics.gov.uk/geoportal/catalog/content/filelist.page
!wget https://geoportal.statistics.gov.uk/Docs/Names%20and%20Codes/Lower_layer_super_output_areas_\(E+W\)_2001_Names_and_Codes.zip
#https://geoportal.statistics.gov.uk/Docs/Names%20and%20Codes/Lower_layer_super_output_areas_(E+W)_2011_Names_and_Codes.zip

--2021-11-03 13:55:23--  https://geoportal.statistics.gov.uk/Docs/Names%20and%20Codes/Lower_layer_super_output_areas_(E+W)_2001_Names_and_Codes.zip
Resolving geoportal.statistics.gov.uk (geoportal.statistics.gov.uk)... 34.196.253.122, 3.212.45.199, 54.173.203.10
Connecting to geoportal.statistics.gov.uk (geoportal.statistics.gov.uk)|34.196.253.122|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-11-03 13:55:23 ERROR 404: Not Found.

Okay, seems like the web archive doesn't have that.But if we web search with the filename (Lower_layer_super_output_areas_(E+W)_2001_Names_and_Codes.zip) we can quickly find an old data.gov.uk page that appears to have some similar data, albeit for 2011: https://data.gov.uk/dataset/c4644a20-16e0-447b-be05-7a23e8d1517d/lower-layer-super-output-areas-december-2011-names-and-codes-in-england-and-wales

In [19]:

!wget http://geoportal1-ons.opendata.arcgis.com/datasets/3ce71e53d9254a73b3e887a506b82f63_0.csv

--2021-11-03 14:08:38--  http://geoportal1-ons.opendata.arcgis.com/datasets/3ce71e53d9254a73b3e887a506b82f63_0.csv
Resolving geoportal1-ons.opendata.arcgis.com (geoportal1-ons.opendata.arcgis.com)... 34.197.204.215, 3.219.212.222, 54.209.123.207
Connecting to geoportal1-ons.opendata.arcgis.com (geoportal1-ons.opendata.arcgis.com)|34.197.204.215|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://geoportal1-ons.opendata.arcgis.com/datasets/3ce71e53d9254a73b3e887a506b82f63_0.csv [following]
--2021-11-03 14:08:38--  https://geoportal1-ons.opendata.arcgis.com/datasets/3ce71e53d9254a73b3e887a506b82f63_0.csv
Connecting to geoportal1-ons.opendata.arcgis.com (geoportal1-ons.opendata.arcgis.com)|34.197.204.215|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘3ce71e53d9254a73b3e887a506b82f63_0.csv’

3ce71e53d9254a73b3e     [ <=>                ]   1.07M  --.-KB/s    in 0.1s    

2021-11-03 14:08:39 (7.50 MB/s) - ‘3ce71e53d9254a73b3e887a506b82f63_0.csv’ saved [1118065]

In [9]:

#Unzip the downloaded data file
# NO LONGER REQUIRED
#!unzip -d data/accidents Lower_layer_super_output_areas_\(E+W\)_2001_Names_and_Codes.zip

Archive:  Lower_layer_super_output_areas_(E+W)_2001_Names_and_Codes.zip
  inflating: data/accidents/LSOA_2001_EW_NC.csv  
  inflating: data/accidents/Product Specification for LSOA_2001_EW_NC.docx  
  inflating: data/accidents/LSOA_2001_EW_NC_metadata.xml

In [30]:

# COPY NEW FILE ONTO OLD FILENAME
!cp 3ce71e53d9254a73b3e887a506b82f63_0.csv data/accidents/LSOA_2011_EW_NC.csv

!ls data/accidents

Accidents0512.csv    lsoa_mk.txt
Casualty0512.csv     Road-Accident-Safety-Data-Guide-1979-2004.xls
LSOA_2011_EW_NC.csv  Vehicles0512.csv

Let's have a quick peek at the data...

In [32]:

LSOA_PATH = 'data/accidents/LSOA_2011_EW_NC.csv' # was 'data/accidents/LSOA_2001_EW_NC.csv'
pd.read_csv(LSOA_PATH, nrows=3)

Out[32]:

	LSOA11CD	LSOA11NM	FID
0	E01000054	Barking and Dagenham 018A	1
1	E01000451	Bexley 009F	2
2	E01000175	Barnet 006E	3

LOOKS LIKE WE HAVE 11 IN THE COLNAMES NOT 01.

Let's do a test case on a search for Milton Keynes.

Rather than load all the data into pandas dataframes and work with it in a notebook, I'm going to do the preliminary work using some of the commandline tools provided as part of the csvkit suite.

(Feel free to try to do a similar set of operations yourself using pandas.)

In [22]:

#%pip3 install csvkit

csvgrep is named after the grep *nix file search commandline tool. The -c argument identifies the column we want to search in, and the -m argument determines the search phrase.

We then pipe - | - the result into the csvcut tool, which extracts the contents of the column or columns specified by the -c flag. The final two parts of the command remove the first, header line from the result (tail -n +2 - show the tail of the file starting at the second line) and then the next five.

In [34]:

# REFACTOR THINGS RATHER THAN HAVING LITERALS REPEATED EVERYWHERE
NAME_CODE = "LSOA11NM" # was LSOA01NM
CODE_CODE = "LSOA11CD" # was LSOA01CD
PLACE = "Milton Keynes"

!csvgrep -c $NAME_CODE -m "$PLACE" $LSOA_PATH | csvcut -c $CODE_CODE | tail -n +2 | head -n 5

This is where we need to take a deep breath and think clearly. What do we actually want to do with the accident dataset? We have some accident data in one file that includes a column identifying the LSOA. In another file we have a list of LSOA codes that are listed against a common name for a district and area code. If we can get a list a LSOA codes associated with Milton Keynes, then we can search the accidents file against each of these codes to get a list of accidents occurring within LSOAs associated with Milton Keynes.

csvgrep will allow us to search for rows containing Milton Keynes in the LSOA name column and extract the corresponding LSOA codes. We can then use these codes as the basis for a search for accidents occurring in those areas in the accidents dataset.

Let's grab the LSOA codes for areas that mention Milton Keynes in their name and pipe the output of the full filter search - without a header - into a new file.

In [36]:

!csvgrep -c $NAME_CODE -m "$PLACE" $LSOA_PATH | csvcut -c $CODE_CODE | tail -n +2 > data/accidents/lsoa_mk.txt

# Original had incorrect setting of header=0
pd.read_csv('data/accidents/lsoa_mk.txt',nrows=3, header=None)

Out[36]:

	0
0	E01033497
1	E01033506
2	E01033498

We can now use the csvgrep command to search the contents of the accidents file for rows that contain the LSOA code using LSOA codes for Milton Keynes, which is to say, the codes we grabbed into a separate file.

In [37]:

#The csvgrep command's -f flag can be used to take search arguments from the first file
# and apply them to the file that we want to search through
#The column headings are preserved in the output file (which is different result to the vanilla grep operation)
!csvgrep -c LSOA_of_Accident_Location -f data/accidents/lsoa_mk.txt data/accidents/Accidents0512.csv > data/accidents/mkAccidents.csv 

Let's do a quick count to see how many there were:

In [38]:

!wc -l data/accidents/mkAccidents.csv

4660 data/accidents/mkAccidents.csv

The next step is to quickly pop these markers onto a map to have a look at them. I'm going to use the folium library.

In [1]:

#%pip install folium

import folium
#DEPRECATED
#folium.initialize_notebook()

In [4]:

'''
# THIS WAS FROM AGES AGO...
# ANYONE WHO USES folium NOW WOULD LOOK AT THIS AND THINK: WTF?

from IPython.display import HTML
def inline_map(map):
    """
    Embeds the HTML source of the map directly into the IPython notebook.
    
    This method will not work if the map depends on any files (json data). Also this uses
    the HTML5 srcdoc attribute, which may not be supported in all browsers.
    """
    map._build_map()
    return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%; height: 510px; border: none"></iframe>'.format(srcdoc=map.HTML.replace('"', '&quot;')))

def embed_map(map, path="map.html"):
    """
    Embeds a linked iframe to the map into the IPython notebook.
    
    Note: this method will not capture the source of the map into the notebook.
    This method should work for all maps (as long as they use relative urls).
    """
    map.create_map(path=path)
    return HTML('<iframe src="files/{path}" style="width: 100%; height: 510px; border: none"></iframe>'.format(path=path))
''';

In [41]:

mkAccidents=pd.read_csv('data/accidents/mkAccidents.csv')

In [42]:

mkAccidents[:5]

Out[42]:

	Accident_Index	Location_Easting_OSGR	Location_Northing_OSGR	Longitude	Latitude	Police_Force	Accident_Severity	Number_of_Vehicles	Number_of_Casualties	Date	...	Light_Conditions	Weather_Conditions	Road_Surface_Conditions	Urban_or_Rural_Area	Did_Police_Officer_Attend_Scene_of_Accident	LSOA_of_Accident_Location
0	200543N003115	484300	245540	-0.770665	52.101742	43	3	2	5	01/11/2005	...	4	1	2	2	1	E01016802
1	200543N005105	486590	236700	-0.739479	52.021930	43	3	2	2	01/10/2005	...	4	1	2	1	1	E01016848
2	200543N006095	483380	238900	-0.785714	52.042196	43	3	2	2	06/09/2005	...	1	1	1	2	1	E01016813
3	200543N006115	480090	240060	-0.833401	52.053108	43	2	2	2	01/11/2005	...	6	1	2	2	1	E01016811
4	200543N011085	481730	240140	-0.809469	52.053588	43	3	1	1	02/08/2005	...	1	1	1	1	2	E01016838

5 rows × 32 columns

Let's create a base map.

In [43]:

mk_accident_map = folium.Map(location=[52,-0.75], zoom_start=11)

Now let's see if we can add the accident markers to it. We have the accident data loaded in to a pandas dataframe, so we can can use apply() to call a function for each row that will add a marker to the map.

In [50]:

# THIS IS AN HACKED VERSION OF THE PREVIOUS BROKEN VERSION
def addMarker(accident):
    #mk_accident_map.simple_marker( \
    folium.Marker( \
             [accident['Latitude'],accident['Longitude']],\
             popup="Accident severity: "+str(accident['Accident_Severity']))\
        .add_to(mk_accident_map)

Recalling that there were several thousand accidents, let's limit ourselves to just putting a handful of markers on the map for now to try things out!

In [51]:

#Generate a map preview using data from just 20 accidents
mkAccidents[:20].apply(addMarker,axis=1)
#inline_map(mk_accident_map)

mk_accident_map

Out[51]:

Make this Notebook Trusted to load map: File -> Trust Notebook

Leaflet maps have a plugin that can automatically cluster large numbers of markers, rather than displaying them all at once, and will then unpack the clusters to individual markers as you zoom in on a region.

This means that if there is a large number of markers in a similar area, we can cluster them at low zoom levels and only reveal the individual markers as we zoom into the area to view more local detail.

In [63]:

#NEW
from folium.plugins import MarkerCluster

def _addClusterMarker(cluster, accident):
    #fmap.simple_marker( \
    folium.Marker([accident['Latitude'],accident['Longitude']],
                  popup="Accident severity: "+str(accident['Accident_Severity']),
                  #clustered_marker=True)\
                        ).add_to(cluster)

In [65]:

mk_accident_clustermap = folium.Map(location=[52,-0.75], zoom_start=11)

mCluster = MarkerCluster().add_to(mk_accident_clustermap)

def addClusterMarker(accident):
    _addClusterMarker(mCluster, accident)

#Let's generate this map using all the data
mkAccidents.apply(addClusterMarker, axis=1)
#inline_map(mk_accident_clustermap)
mk_accident_clustermap

Out[65]:

Make this Notebook Trusted to load map: File -> Trust Notebook

Zooming in to this map, one thing I notice is how the roundabouts appear to act as focal points for accidents.

What Next?¶

Feel free to build on this investigation, or use it as the basis to carry out a similar preliminary investigation of data from your own area.

Upload your completed investigation notebook to OpenDesignStudio, either privately, or share it with other students so that you can compare notes about the techniques you used in your separate investigations and/or discuss with each other what you discovered.