This notebook has been updated from an earlier example which has rotted over time. IT IS OF HISTORICAL INTEREST ONLY AND THE CODE MAY NOT RUN, NOT LEAST BECUAE THE DATA FILES IT RELIES ON HAVE DISAPPEARED. It includes notes showing where the original notebook had broken, and fixes made to it. In many cases, looking to web resources for assistance will turn up thimgs that are almost what you want, or not quite, or that worked once, and have since rotted. An effective practitioner will then take what they can for the resource, or if only minimal fixes are required, make them themselves and prerhaps contribute them back. That's what open practice is all about.
This example data investigation represents a preliminary exploration of road traffic accident data for Milton Keynes. Feel free to use it as a basis for a quick exploration of the data for your local area.
Note that the conversation descends into Unix commandline arcana in several places to parse some of the downloaded text files. We will not be formally teaching you how to use Unix commandline commands in this module. However, in many cases, you should be able to copy the commands and just change the bits that are obviously parameter values to use values of your own. Alternatively, you may choose to use a different technique to achieve a similar effect. As with many data wrangling exercises, there is typically more than one way to do it...
First locate some data... A quick search identifies Stats19 as a source of RTA incident data in the UK. We can find data files on data.gov.uk:
#Road safety data
#http://data.gov.uk/dataset/road-accidents-safety-data
#We can use the Linux commandline command wget to retrieve a datafile from a remote URL
!wget http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19-Data2005-2012.zip
--2021-11-03 13:40:29-- http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19-Data2005-2012.zip Resolving data.dft.gov.uk.s3.amazonaws.com (data.dft.gov.uk.s3.amazonaws.com)... 52.218.97.145 Connecting to data.dft.gov.uk.s3.amazonaws.com (data.dft.gov.uk.s3.amazonaws.com)|52.218.97.145|:80... connected. HTTP request sent, awaiting response... 403 Forbidden 2021-11-03 13:40:29 ERROR 403: Forbidden.
Okay, so that URL has rotted. Running a web search for something like "Stats19 data" might turn up a current source, such as https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data (how hard was that?).
Alternatively, we could check the Internet Archive with the zip URL, which turns up an archived copy:
# Use the web archive URL
!wget https://web.archive.org/web/20140907153326/http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19-Data2005-2012.zip
--2021-11-03 13:53:42-- https://web.archive.org/web/20140907153326/http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19-Data2005-2012.zip Resolving web.archive.org (web.archive.org)... 207.241.237.3 Connecting to web.archive.org (web.archive.org)|207.241.237.3|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/x-zip-compressed] Saving to: ‘Stats19-Data2005-2012.zip’ Stats19-Data2005-20 [ <=> ] 72.70M 1.31MB/s in 47s 2021-11-03 13:54:31 (1.53 MB/s) - ‘Stats19-Data2005-2012.zip’ saved [76235891]
Let's unzip the data and then preview it:
#Create a directory for this investigation
!mkdir -p data/accidents
#The Linux commandline command unzip will unzip the downloaded file.
#The -d flag sets the target directory into which the unzipped files should be placed.
!unzip -d data/accidents Stats19-Data2005-2012.zip
Archive: Stats19-Data2005-2012.zip inflating: data/accidents/Accidents0512.csv inflating: data/accidents/Vehicles0512.csv inflating: data/accidents/Casualty0512.csv inflating: data/accidents/Road-Accident-Safety-Data-Guide-1979-2004.xls
#%pip install pandas
import pandas as pd
#Preview the first few lines of the data file to get a feel for what's in there
#We could use the commandline command
#!head -n 3 data/accidents/Accidents0512.csv
#Or we can use pandas
pd.read_csv('data/accidents/Accidents0512.csv',nrows=3)
Accident_Index | Location_Easting_OSGR | Location_Northing_OSGR | Longitude | Latitude | Police_Force | Accident_Severity | Number_of_Vehicles | Number_of_Casualties | Date | ... | Pedestrian_Crossing-Human_Control | Pedestrian_Crossing-Physical_Facilities | Light_Conditions | Weather_Conditions | Road_Surface_Conditions | Special_Conditions_at_Site | Carriageway_Hazards | Urban_or_Rural_Area | Did_Police_Officer_Attend_Scene_of_Accident | LSOA_of_Accident_Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 200501BS00001 | 525680 | 178240 | -0.191170 | 51.489096 | 1 | 2 | 1 | 1 | 04/01/2005 | ... | 0 | 1 | 1 | 2 | 2 | 0 | 0 | 1 | 1 | E01002849 |
1 | 200501BS00002 | 524170 | 181650 | -0.211708 | 51.520075 | 1 | 3 | 1 | 1 | 05/01/2005 | ... | 0 | 5 | 4 | 1 | 1 | 0 | 0 | 1 | 1 | E01002909 |
2 | 200501BS00003 | 524520 | 182240 | -0.206458 | 51.525301 | 1 | 3 | 2 | 1 | 06/01/2005 | ... | 0 | 0 | 4 | 1 | 1 | 0 | 0 | 1 | 1 | E01002857 |
3 rows × 32 columns
The accident location is given as a lat/long co-ordinate, and several administrative region codes are given as well: in particular, the corresponding LSOA - Lower Super Output Area - in which the accident occurred. This is a low level administrative geography region. (If you work with UK administrative datasets, you will find it useful to familiarise yourself with the administrative geograhy of the UK. A good place to start is the ONS guide to UK administrative geography.)
We can try find the LSOAs in the Milton Keynes administrative area by looking up the codes with a search over an LSOA catalogue file using the search term Milton Keynes. So where can we find administrative area data? How about ONS - the Office of National Statistics?
#LSOA areas
#https://geoportal.statistics.gov.uk/geoportal/catalog/content/filelist.page
!wget https://geoportal.statistics.gov.uk/Docs/Names%20and%20Codes/Lower_layer_super_output_areas_\(E+W\)_2001_Names_and_Codes.zip
#https://geoportal.statistics.gov.uk/Docs/Names%20and%20Codes/Lower_layer_super_output_areas_(E+W)_2011_Names_and_Codes.zip
--2021-11-03 13:55:23-- https://geoportal.statistics.gov.uk/Docs/Names%20and%20Codes/Lower_layer_super_output_areas_(E+W)_2001_Names_and_Codes.zip Resolving geoportal.statistics.gov.uk (geoportal.statistics.gov.uk)... 34.196.253.122, 3.212.45.199, 54.173.203.10 Connecting to geoportal.statistics.gov.uk (geoportal.statistics.gov.uk)|34.196.253.122|:443... connected. HTTP request sent, awaiting response... 404 Not Found 2021-11-03 13:55:23 ERROR 404: Not Found.
Okay, seems like the web archive doesn't have that.But if we web search with the filename (Lower_layer_super_output_areas_(E+W)_2001_Names_and_Codes.zip) we can quickly find an old data.gov.uk page that appears to have some similar data, albeit for 2011: https://data.gov.uk/dataset/c4644a20-16e0-447b-be05-7a23e8d1517d/lower-layer-super-output-areas-december-2011-names-and-codes-in-england-and-wales
!wget http://geoportal1-ons.opendata.arcgis.com/datasets/3ce71e53d9254a73b3e887a506b82f63_0.csv
--2021-11-03 14:08:38-- http://geoportal1-ons.opendata.arcgis.com/datasets/3ce71e53d9254a73b3e887a506b82f63_0.csv Resolving geoportal1-ons.opendata.arcgis.com (geoportal1-ons.opendata.arcgis.com)... 34.197.204.215, 3.219.212.222, 54.209.123.207 Connecting to geoportal1-ons.opendata.arcgis.com (geoportal1-ons.opendata.arcgis.com)|34.197.204.215|:80... connected. HTTP request sent, awaiting response... 302 Moved Temporarily Location: https://geoportal1-ons.opendata.arcgis.com/datasets/3ce71e53d9254a73b3e887a506b82f63_0.csv [following] --2021-11-03 14:08:38-- https://geoportal1-ons.opendata.arcgis.com/datasets/3ce71e53d9254a73b3e887a506b82f63_0.csv Connecting to geoportal1-ons.opendata.arcgis.com (geoportal1-ons.opendata.arcgis.com)|34.197.204.215|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/csv] Saving to: ‘3ce71e53d9254a73b3e887a506b82f63_0.csv’ 3ce71e53d9254a73b3e [ <=> ] 1.07M --.-KB/s in 0.1s 2021-11-03 14:08:39 (7.50 MB/s) - ‘3ce71e53d9254a73b3e887a506b82f63_0.csv’ saved [1118065]
#Unzip the downloaded data file
# NO LONGER REQUIRED
#!unzip -d data/accidents Lower_layer_super_output_areas_\(E+W\)_2001_Names_and_Codes.zip
Archive: Lower_layer_super_output_areas_(E+W)_2001_Names_and_Codes.zip inflating: data/accidents/LSOA_2001_EW_NC.csv inflating: data/accidents/Product Specification for LSOA_2001_EW_NC.docx inflating: data/accidents/LSOA_2001_EW_NC_metadata.xml
# COPY NEW FILE ONTO OLD FILENAME
!cp 3ce71e53d9254a73b3e887a506b82f63_0.csv data/accidents/LSOA_2011_EW_NC.csv
!ls data/accidents
Accidents0512.csv lsoa_mk.txt Casualty0512.csv Road-Accident-Safety-Data-Guide-1979-2004.xls LSOA_2011_EW_NC.csv Vehicles0512.csv
Let's have a quick peek at the data...
LSOA_PATH = 'data/accidents/LSOA_2011_EW_NC.csv' # was 'data/accidents/LSOA_2001_EW_NC.csv'
pd.read_csv(LSOA_PATH, nrows=3)
LSOA11CD | LSOA11NM | FID | |
---|---|---|---|
0 | E01000054 | Barking and Dagenham 018A | 1 |
1 | E01000451 | Bexley 009F | 2 |
2 | E01000175 | Barnet 006E | 3 |
LOOKS LIKE WE HAVE 11 IN THE COLNAMES NOT 01.
Let's do a test case on a search for Milton Keynes.
Rather than load all the data into pandas dataframes and work with it in a notebook, I'm going to do the preliminary work using some of the commandline tools provided as part of the csvkit
suite.
(Feel free to try to do a similar set of operations yourself using pandas.)
#%pip3 install csvkit
csvgrep
is named after the grep
*nix file search commandline tool. The -c argument identifies the column we want to search in, and the -m argument determines the search phrase.
We then pipe - | - the result into the csvcut
tool, which extracts the contents of the column or columns specified by the -c flag. The final two parts of the command remove the first, header line from the result (tail -n +2
- show the tail of the file starting at the second line) and then the next five.
# REFACTOR THINGS RATHER THAN HAVING LITERALS REPEATED EVERYWHERE
NAME_CODE = "LSOA11NM" # was LSOA01NM
CODE_CODE = "LSOA11CD" # was LSOA01CD
PLACE = "Milton Keynes"
!csvgrep -c $NAME_CODE -m "$PLACE" $LSOA_PATH | csvcut -c $CODE_CODE | tail -n +2 | head -n 5
E01033497 E01033506 E01033498 E01033499 E01033508
This is where we need to take a deep breath and think clearly. What do we actually want to do with the accident dataset? We have some accident data in one file that includes a column identifying the LSOA. In another file we have a list of LSOA codes that are listed against a common name for a district and area code. If we can get a list a LSOA codes associated with Milton Keynes, then we can search the accidents file against each of these codes to get a list of accidents occurring within LSOAs associated with Milton Keynes.
csvgrep
will allow us to search for rows containing Milton Keynes in the LSOA name column and extract the corresponding LSOA codes. We can then use these codes as the basis for a search for accidents occurring in those areas in the accidents dataset.
Let's grab the LSOA codes for areas that mention Milton Keynes in their name and pipe the output of the full filter search - without a header - into a new file.
!csvgrep -c $NAME_CODE -m "$PLACE" $LSOA_PATH | csvcut -c $CODE_CODE | tail -n +2 > data/accidents/lsoa_mk.txt
# Original had incorrect setting of header=0
pd.read_csv('data/accidents/lsoa_mk.txt',nrows=3, header=None)
0 | |
---|---|
0 | E01033497 |
1 | E01033506 |
2 | E01033498 |
We can now use the csvgrep
command to search the contents of the accidents file for rows that contain the LSOA code using LSOA codes for Milton Keynes, which is to say, the codes we grabbed into a separate file.
#The csvgrep command's -f flag can be used to take search arguments from the first file
# and apply them to the file that we want to search through
#The column headings are preserved in the output file (which is different result to the vanilla grep operation)
!csvgrep -c LSOA_of_Accident_Location -f data/accidents/lsoa_mk.txt data/accidents/Accidents0512.csv > data/accidents/mkAccidents.csv
Let's do a quick count to see how many there were:
!wc -l data/accidents/mkAccidents.csv
4660 data/accidents/mkAccidents.csv
The next step is to quickly pop these markers onto a map to have a look at them. I'm going to use the folium
library.
#%pip install folium
import folium
#DEPRECATED
#folium.initialize_notebook()
'''
# THIS WAS FROM AGES AGO...
# ANYONE WHO USES folium NOW WOULD LOOK AT THIS AND THINK: WTF?
from IPython.display import HTML
def inline_map(map):
"""
Embeds the HTML source of the map directly into the IPython notebook.
This method will not work if the map depends on any files (json data). Also this uses
the HTML5 srcdoc attribute, which may not be supported in all browsers.
"""
map._build_map()
return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%; height: 510px; border: none"></iframe>'.format(srcdoc=map.HTML.replace('"', '"')))
def embed_map(map, path="map.html"):
"""
Embeds a linked iframe to the map into the IPython notebook.
Note: this method will not capture the source of the map into the notebook.
This method should work for all maps (as long as they use relative urls).
"""
map.create_map(path=path)
return HTML('<iframe src="files/{path}" style="width: 100%; height: 510px; border: none"></iframe>'.format(path=path))
''';
mkAccidents=pd.read_csv('data/accidents/mkAccidents.csv')
mkAccidents[:5]
Accident_Index | Location_Easting_OSGR | Location_Northing_OSGR | Longitude | Latitude | Police_Force | Accident_Severity | Number_of_Vehicles | Number_of_Casualties | Date | ... | Pedestrian_Crossing-Human_Control | Pedestrian_Crossing-Physical_Facilities | Light_Conditions | Weather_Conditions | Road_Surface_Conditions | Special_Conditions_at_Site | Carriageway_Hazards | Urban_or_Rural_Area | Did_Police_Officer_Attend_Scene_of_Accident | LSOA_of_Accident_Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 200543N003115 | 484300 | 245540 | -0.770665 | 52.101742 | 43 | 3 | 2 | 5 | 01/11/2005 | ... | 0 | 0 | 4 | 1 | 2 | 0 | 0 | 2 | 1 | E01016802 |
1 | 200543N005105 | 486590 | 236700 | -0.739479 | 52.021930 | 43 | 3 | 2 | 2 | 01/10/2005 | ... | 0 | 0 | 4 | 1 | 2 | 0 | 0 | 1 | 1 | E01016848 |
2 | 200543N006095 | 483380 | 238900 | -0.785714 | 52.042196 | 43 | 3 | 2 | 2 | 06/09/2005 | ... | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 2 | 1 | E01016813 |
3 | 200543N006115 | 480090 | 240060 | -0.833401 | 52.053108 | 43 | 2 | 2 | 2 | 01/11/2005 | ... | 0 | 0 | 6 | 1 | 2 | 0 | 0 | 2 | 1 | E01016811 |
4 | 200543N011085 | 481730 | 240140 | -0.809469 | 52.053588 | 43 | 3 | 1 | 1 | 02/08/2005 | ... | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 2 | E01016838 |
5 rows × 32 columns
Let's create a base map.
mk_accident_map = folium.Map(location=[52,-0.75], zoom_start=11)
Now let's see if we can add the accident markers to it. We have the accident data loaded in to a pandas dataframe, so we can can use apply()
to call a function for each row that will add a marker to the map.
# THIS IS AN HACKED VERSION OF THE PREVIOUS BROKEN VERSION
def addMarker(accident):
#mk_accident_map.simple_marker( \
folium.Marker( \
[accident['Latitude'],accident['Longitude']],\
popup="Accident severity: "+str(accident['Accident_Severity']))\
.add_to(mk_accident_map)
Recalling that there were several thousand accidents, let's limit ourselves to just putting a handful of markers on the map for now to try things out!
#Generate a map preview using data from just 20 accidents
mkAccidents[:20].apply(addMarker,axis=1)
#inline_map(mk_accident_map)
mk_accident_map
Leaflet maps have a plugin that can automatically cluster large numbers of markers, rather than displaying them all at once, and will then unpack the clusters to individual markers as you zoom in on a region.
This means that if there is a large number of markers in a similar area, we can cluster them at low zoom levels and only reveal the individual markers as we zoom into the area to view more local detail.
#NEW
from folium.plugins import MarkerCluster
def _addClusterMarker(cluster, accident):
#fmap.simple_marker( \
folium.Marker([accident['Latitude'],accident['Longitude']],
popup="Accident severity: "+str(accident['Accident_Severity']),
#clustered_marker=True)\
).add_to(cluster)
mk_accident_clustermap = folium.Map(location=[52,-0.75], zoom_start=11)
mCluster = MarkerCluster().add_to(mk_accident_clustermap)
def addClusterMarker(accident):
_addClusterMarker(mCluster, accident)
#Let's generate this map using all the data
mkAccidents.apply(addClusterMarker, axis=1)
#inline_map(mk_accident_clustermap)
mk_accident_clustermap
Zooming in to this map, one thing I notice is how the roundabouts appear to act as focal points for accidents.
Feel free to build on this investigation, or use it as the basis to carry out a similar preliminary investigation of data from your own area.
Upload your completed investigation notebook to OpenDesignStudio, either privately, or share it with other students so that you can compare notes about the techniques you used in your separate investigations and/or discuss with each other what you discovered.