In this assignment, we will be working with Automated License Plate Reader (ALPR) data from the Oakland, CA Police Department.
The data we will be using in this assignment has previously been used in Data 8 to teach a lesson about privacy. In that lesson, students were introduced to Oakland ALPR data and used the data to track the whereabouts of a former Mayor of Oakland. Students thus considered some of the privacy concerns that ALPR data raise, including giving those with access to the data the ability to determine where people live, work, and what they do in their free time. One of the conclusions of the privacy lecture was that data collected for one purpose (such as to “fight crime”) can reveal a lot more than initially intended. We carry this same objective over to this new assignment, in which we aim to introduce you to the social and historical contexts of Automated License Plate Reader data, and how the collection and use of this data distributes risks among different population groups unevenly.
Automated License Plate Readers, which are usually mounted on police cars, capture digital license plate data and images for law enforcement purposes (Policy Manual). In particular, the Oakland Police Department writes in their ALPR policy manual that the data can be used for “identifying stolen or wanted vehicles, stolen license plates and missing persons. It may also be used to gather information related to active warrants, suspect interdiction and stolen property recovery” (Policy Manual). As of 2015, Oakland PD operated 33 license plate readers, each of which can scan up to 60 license plates per second (Ars Technica). Thus, Oakland PD stores millions of records, of which only a small portion are associated with a criminal investigation (ACLU). While Oakland PD says that its data collection and storage procedures adhere to privacy rights, including purging unused data after 6 months and restricting the sharing of data, it is crucial to note that millions of Oakland PD license plate records are available online – this is how we found the data for this assignment (Policy Manual). In addition, while the Supreme Court has ruled that cars on public roads do not have a reasonable expectation of privacy, it is important to remember that “reasonable suspicion or probable cause is not required before using an ALPR to scan license plates or collect data” (Policy Manual).
Q0a: What is one reason that Oakland PD gives for collecting ALPR data?
Answer here
Q0b: What is one drawback to collecting and using ALPR data?
Answer here
import warnings
warnings.filterwarnings('ignore')
from datascience import *
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import folium
import mapclassify
import rtree
from folium.plugins import HeatMap
from sklearn.preprocessing import normalize
%matplotlib inline
plt.style.use('fivethirtyeight')
All available ALPR data can be found here. We are working with data from (4/1/14 - 5/31/14). This is what was recent when we it was downloaded for Data 8. While the data is not the most recent now, you can see how the OPD policy of deleting data after 6 months can be circumvented by private parties accessing and storing the data seperately from the city's databases.
alpr = Table().read_table("data/lprs.csv")
alpr
red_VRM | red_Timestamp | Location 1 |
---|---|---|
6EZZ778 | 05/31/2014 12:41:00 PM | (37.838856, -122.221971) |
4VLN123 | 05/31/2014 12:41:00 PM | (37.838911, -122.222023) |
6EJR528 | 05/31/2014 12:41:00 PM | (37.838911, -122.222023) |
5BEJ534 | 05/31/2014 12:41:00 PM | (37.83896, -122.222071) |
6WQN812 | 05/31/2014 12:41:00 PM | (37.839103, -122.22221) |
5RJP156 | 05/31/2014 12:41:00 PM | (37.839228, -122.222323) |
5RZT811 | 05/31/2014 12:41:00 PM | (37.839295, -122.222388) |
7CHA147 | 05/31/2014 12:41:00 PM | (37.839596, -122.222893) |
7EEW593 | 05/31/2014 12:41:00 PM | (37.839593, -122.222883) |
6PBE505 | 05/31/2014 12:40:00 PM | (37.84137, -122.224333) |
... (328572 rows omitted)
Q1: What can we learn from the table alpr
? What do the columns represent, and what do the rows represent?
Answer here
Q2: How many license plate readings are in our data from 4/1/14 - 5/31/14? follow-up...
Answer here
The ALPR data contains some useful attributes for visualizing and analyzing.
As a first step in cleaning alpr
We would like to relabel our columns to make them easier to work with and sort the ALPR data by Timestamp in chronological order. We also need to split our location into seperate latitude and longitude columns so that we can leverage spatial libraries.
Consider the following column names in our final result: 'Plate', 'Timestamp', 'Latitude', and 'Longitude'.
def getlatitude(s):
before, after = s.split(',') # Break it into two parts
latstring = before[1:] # Get rid of the annoying '('
return float(latstring) # Convert the string to a number
def getlongitude(s):
before, after = s.split(',') # Break it into two parts
longstring = after[1:-1] # Get rid of the ' ' and the ')'
return float(longstring) # Convert the string to a number
# Relabel columns
alpr = alpr.relabeled("red_VRM", "Plate").relabeled("red_Timestamp", "Timestamp")
# Split Location to 'Latitude' and 'Longitude' columns
alpr = alpr.with_columns("Latitude", alpr.apply(getlatitude, "Location 1"), "Longitude", alpr.apply(getlongitude, "Location 1")).drop("Location 1")
# Sort the LPRS data by Timestamp in chronological order
alpr = alpr.sort("Timestamp", descending=False)
alpr
Plate | Timestamp | Latitude | Longitude |
---|---|---|---|
6LWL396 | 04/01/2014 01:00:00 PM | 37.8048 | -122.251 |
4DGR470 | 04/01/2014 01:00:00 PM | 37.7938 | -122.253 |
6RIP575 | 04/01/2014 01:00:00 PM | 37.8047 | -122.251 |
4ZDX994 | 04/01/2014 01:00:00 PM | 37.8046 | -122.251 |
6XJW220 | 04/01/2014 01:00:00 PM | 37.8046 | -122.251 |
6J44213 | 04/01/2014 01:00:00 PM | 37.8042 | -122.252 |
6B03075 | 04/01/2014 01:00:00 PM | 37.8042 | -122.252 |
4PKP608 | 04/01/2014 01:00:00 PM | 37.794 | -122.255 |
4RPB940 | 04/01/2014 01:00:00 PM | 37.7939 | -122.255 |
5MBD011 | 04/01/2014 01:00:00 PM | 37.7921 | -122.248 |
... (328572 rows omitted)
One of the firsts tasks when analyzing data is to explore its distribution. The distribution of values and locations will significantly impact the kind of analysis you can preform on the data and the kind of errors
plate_counts = Table().read_table("data/plate_counts.csv")
plate_counts
plate_counts.hist('times seen', bins=np.arange(0, 10, 1))
plate_counts_above_10 = plate_counts.where("times seen", are.above_or_equal_to(10)).num_rows
plate_counts_above_10
2117
Q3: Can we compare plate counts for each unique plate to the number of registered vehicles in Oakland? What are the implications of this result? What is the gap between these two pieces of information?
#Append time seen count to alpr object already in ALPR object
alpr = Table().read_table("data/alpr.csv")
alpr.show(10)
Plate | Timestamp | Latitude | Longitude | times seen |
---|---|---|---|---|
0008ZH1 | 04/02/2014 11:46:00 PM | 37.8041 | -122.299 | 3 |
0008ZH1 | 05/11/2014 09:45:00 PM | 37.8046 | -122.299 | 3 |
0008ZH1 | 05/20/2014 09:48:00 PM | 37.8041 | -122.299 | 3 |
00096B1 | 04/14/2014 09:02:00 AM | 37.7905 | -122.218 | 1 |
000CA | 04/10/2014 11:24:00 PM | 37.7895 | -122.245 | 5 |
000CA | 04/28/2014 11:04:00 AM | 37.7898 | -122.244 | 5 |
000CA | 04/28/2014 11:08:00 AM | 37.7896 | -122.245 | 5 |
000CA | 05/06/2014 08:34:00 AM | 37.7872 | -122.239 | 5 |
000CA | 05/28/2014 04:59:00 PM | 37.8116 | -122.266 | 5 |
000EK | 04/06/2014 01:41:00 PM | 37.8035 | -122.234 | 1 |
... (328572 rows omitted)
Most visualizations in data science are about seeing values in a statistically meaningful way. When your data has the added dimension of space, you can display this complexity using maps. As a visualization, mapped data brings the meaning of place into the conversationtion. Since people usually have an attachment to certain places and an understanding of the world that includes a spatial component, maps can be a great way to make data more interesting and meaningful. Many phenomenon are spatial by nature, giving it spatial context can help.
There are non-spatial libraries you can use to plot latitude and longitude as x and y on an axis but for the most part you want to use a spatial libray for ease of use and added functionality. We will use the folium package because it has a lot of neat functions that make exploration easy.
The first step to using a spatial library is to make your dataset a spatial object. We can use 'type()' to see what kind of object our data is currently. We first opened it using the datascience notebook.
Lets use some maps to visualize the data. First, let's take a look at the first 1000 readings, starting on 4/01/2014.
The following are functions we use to display maps. You are not required to learn how to use them, but the docstring explains the workings.
It can be helpful to look at your data the way the computer sees it before modeling a social phenomenon with a geographic component
Q4: The Open Data Portal the city of Oakland uses has a visualization tool. Visit the page, Did you find this tool helpful? what additional functionalities would you want from this tool?
Before we can bring our data into a GIS (Geographic Information System) in python we need to make sure it is a spatial object that the library can recognize. We can use 'type()' to see what kind of object our data is.
type(alpr)
datascience.tables.Table
Our data has the spatial markers, Latitude and longitude, that we can use to map it. However, it is not a spatial object. Luckily it is easy to turn a datascience table into a pandas dataframe that Geopandas can recognize. Notice the part of code that specifies EPSG. This is a guess, usually the metadata tells you what coordinate referencing system (CRS) was used to collect the day, this is a critical part of metadata since there are many CRS types. The data portal fails to list the CRS used; most public data is in a general web mercator system frequently used with web maps. It is also known as epsg 4326. Folium also requires spatial objects to have a specified CRS.
#this converts our datascience table into a pandas dataframe that we can then turn to a geopandas geodataframe
table_df = alpr.to_df()
# storing this dataframe in a csv file
#table_df.to_csv('/content/alpr.csv', index = None)
#this converts our pandas dataframe into a Geopandas object
alprgs = gpd.GeoDataFrame(
table_df, geometry=gpd.points_from_xy(table_df.Longitude, table_df.Latitude, crs = 'EPSG:4326'))
#Define coordinates of where we want to center our map
bay_coords = [37.871545, -122.260807]
#Create the map
my_map = folium.Map(
location = bay_coords,
tiles='Stamen Toner',
zoom_start = 5)
points = folium.features.GeoJson(alprgs)
my_map.add_child(folium.plugins.FastMarkerCluster(alprgs[['Latitude', 'Longitude']].values.tolist()))
#Display the map may not be needed
display(my_map)
#to save the map, perhaps for embedding in a website or presentation, you could use this code
#my_map.save("save_file.html")