This notebook demonstrates a quick investigation into house price index inflation on the Isle of Wight.
The data will be pulled from the Land Registry open Linked Data datastore (http://landregistry.data.gov.uk/).
#Import the necessary packages
#!pip3 import SPARQLWrapper
from SPARQLWrapper import SPARQLWrapper, JSON
#Add some helper functions
def runQuery(endpoint,prefix,q):
''' Run a SPARQL query with a declared prefix over a specified endpoint '''
sparql = SPARQLWrapper(endpoint)
sparql.setQuery(prefix+q)
sparql.setReturnFormat(JSON)
return sparql.query().convert()
import pandas as pd
def dict2df(results):
''' Hack a function to flatten the SPARQL query results and return the column values '''
data=[]
for result in results["results"]["bindings"]:
tmp={}
for el in result:
tmp[el]=result[el]['value']
data.append(tmp)
df = pd.DataFrame(data)
return df
def dfResults(endpoint,prefix,q):
''' Generate a data frame containing the results of running
a SPARQL query with a declared prefix over a specified endpoint '''
return dict2df( runQuery( endpoint, prefix, q ) )
def printQuery(results,limit=''):
''' Print the results from the SPARQL query '''
resdata=results["results"]["bindings"]
if limit!='': resdata=results["results"]["bindings"][:limit]
for result in resdata:
for ans in result:
print('{0}: {1}'.format(ans,result[ans]['value']))
print()
def printRunQuery(endpoint,prefix,q,limit=''):
''' Print the results from the SPARQL query '''
results=runQuery(endpoint,prefix,q)
printQuery(results,limit)
/usr/local/lib/python3.4/dist-packages/SPARQLWrapper/Wrapper.py:100: RuntimeWarning: JSON-LD disabled because no suitable support has been found warnings.warn("JSON-LD disabled because no suitable support has been found", RuntimeWarning)
The first thing we need to do is set up the endpoint.
endpoint_landreg='http://landregistry.data.gov.uk/landregistry/query'
#The core of this query is based on an example query from http://landregistry.data.gov.uk/app/hpi/qonsole
prefix_lr='''
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX sr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
PREFIX lrhpi: <http://landregistry.data.gov.uk/def/hpi/>
PREFIX lrppi: <http://landregistry.data.gov.uk/def/ppi/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX lrcommon: <http://landregistry.data.gov.uk/def/common/>
'''
q='''
SELECT ?index ?regionName ?yearmonth ?indexr ?region ?avgPriceAll ?avgDetached
?avgSemi ?avgFlats ?avgTerraced ?annual ?volume {
?region rdfs:label ?regionName.
?region rdfs:label "Isle of Wight"@en.
?index a <http://landregistry.data.gov.uk/def/hpi/MonthlyIndicesByRegion>.
?index <http://landregistry.data.gov.uk/def/hpi/refRegion> ?region.
?index
lrhpi:refRegion ?regionURI ;
lrhpi:indicesSASM ?indexr ;
lrhpi:refPeriod ?yearmonth ;
lrhpi:averagePricesSASM ?avgPriceAll ;
lrhpi:monthlyChange ?monthly ;
lrhpi:averagePricesDetachedSASM ?avgDetached ;
lrhpi:averagePricesSemiDetachedSASM ?avgSemi ;
lrhpi:averagePricesFlatMaisonetteSASM ?avgFlats ;
lrhpi:averagePricesTerracedSASM ?avgTerraced .
FILTER(LANG(?regionName ) = "" || LANGMATCHES(LANG(?regionName), "en"))
}
'''
df=dfResults(endpoint_landreg,prefix_lr,q)
df[:5]
avgDetached | avgFlats | avgPriceAll | avgSemi | avgTerraced | index | indexr | region | regionName | yearmonth | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 131372 | 50969 | 77978 | 79652 | 67866 | http://landregistry.data.gov.uk/data/hpi/regio... | 157.17 | http://landregistry.data.gov.uk/id/region/isle... | Isle of Wight | 2000-09 |
1 | 125395 | 48650 | 74430 | 76028 | 64779 | http://landregistry.data.gov.uk/data/hpi/regio... | 150.02 | http://landregistry.data.gov.uk/id/region/isle... | Isle of Wight | 2000-06 |
2 | 123812 | 48037 | 73491 | 75069 | 63961 | http://landregistry.data.gov.uk/data/hpi/regio... | 148.12 | http://landregistry.data.gov.uk/id/region/isle... | Isle of Wight | 2000-05 |
3 | 129363 | 50190 | 76786 | 78434 | 66828 | http://landregistry.data.gov.uk/data/hpi/regio... | 154.76 | http://landregistry.data.gov.uk/id/region/isle... | Isle of Wight | 2000-08 |
4 | 128112 | 49705 | 76043 | 77676 | 66182 | http://landregistry.data.gov.uk/data/hpi/regio... | 153.27 | http://landregistry.data.gov.uk/id/region/isle... | Isle of Wight | 2000-07 |
It could be useful to cast the yearmonth
column to a date. By default, the first of the month will be the assigned data, but a little bit of jiggery pokery can reassign it to the last day of the month.
df['dt']=pd.to_datetime(df.yearmonth,format='%Y-%m')
df.set_index('dt',inplace=True)
#Set the date to be the end of the month
#via http://stackoverflow.com/a/18233876/454773
df=df.to_period('M').to_timestamp('M')
df[:3]
avgDetached | avgFlats | avgPriceAll | avgSemi | avgTerraced | index | indexr | region | regionName | yearmonth | |
---|---|---|---|---|---|---|---|---|---|---|
dt | ||||||||||
2000-09-30 | 131372 | 50969 | 77978 | 79652 | 67866 | http://landregistry.data.gov.uk/data/hpi/regio... | 157.17 | http://landregistry.data.gov.uk/id/region/isle... | Isle of Wight | 2000-09 |
2000-06-30 | 125395 | 48650 | 74430 | 76028 | 64779 | http://landregistry.data.gov.uk/data/hpi/regio... | 150.02 | http://landregistry.data.gov.uk/id/region/isle... | Isle of Wight | 2000-06 |
2000-05-31 | 123812 | 48037 | 73491 | 75069 | 63961 | http://landregistry.data.gov.uk/data/hpi/regio... | 148.12 | http://landregistry.data.gov.uk/id/region/isle... | Isle of Wight | 2000-05 |
Having got lots of data as a time series, it would be quite nice to be able to plot the values as a line chart. A good library for going this is ggplot
. To make things easier when using ggplot
, let's reassign the index column as a non-index column.
df.reset_index(inplace=True)
df[:3]
dt | avgDetached | avgFlats | avgPriceAll | avgSemi | avgTerraced | index | indexr | region | regionName | yearmonth | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000-09-30 | 131372 | 50969 | 77978 | 79652 | 67866 | http://landregistry.data.gov.uk/data/hpi/regio... | 157.17 | http://landregistry.data.gov.uk/id/region/isle... | Isle of Wight | 2000-09 |
1 | 2000-06-30 | 125395 | 48650 | 74430 | 76028 | 64779 | http://landregistry.data.gov.uk/data/hpi/regio... | 150.02 | http://landregistry.data.gov.uk/id/region/isle... | Isle of Wight | 2000-06 |
2 | 2000-05-31 | 123812 | 48037 | 73491 | 75069 | 63961 | http://landregistry.data.gov.uk/data/hpi/regio... | 148.12 | http://landregistry.data.gov.uk/id/region/isle... | Isle of Wight | 2000-05 |
Import the ggplot
library.
#!pip3 install git+https://github.com/tbicr/folium.git@fixed#folium
from ggplot import *
Let's do a test plot...
ggplot(df,aes(x='dt',y='avgDetached'))+geom_line()
<ggplot: (-9223363260180709344)>
One of the nice features of ggplot
is that we can easily generate charts using different coloured lines for data from different groups, if the data is in the right shape.
At the moment, we have data relating to the average price of different types of housing in different columns.
df.columns.values
array(['dt', 'avgDetached', 'avgFlats', 'avgPriceAll', 'avgSemi', 'avgTerraced', 'index', 'indexr', 'region', 'regionName', 'yearmonth'], dtype=object)
If we reshape the data from this wide format to a long format with a single value column and a column identifying the housing class associated with that average price, we can easily plot all the average prices from a single ggplot
command.
So let's reshape the data...
df_melt=pd.melt(df,value_vars=['avgDetached', 'avgFlats', 'avgPriceAll', 'avgSemi',
'avgTerraced'],id_vars='dt')
df_melt[:3]
dt | variable | value | |
---|---|---|---|
0 | 2000-09-30 | avgDetached | 131372 |
1 | 2000-06-30 | avgDetached | 125395 |
2 | 2000-05-31 | avgDetached | 123812 |
And now plotting it is a simple matter.
ggplot(df_melt,aes(x='dt',y='value',colour='variable'))+geom_line()
<ggplot: (8776673340371)>
So there we have it, a quick demo of how to get average house price time series data for a particular region from the Land Registry Linked Data platform, tidy it a little and reshape it using the Python pandas library, and then plot it using ggplot.
And if you want an interactive Javascript version of the chart? Simply call on mpld3 which re-renders matplotlib charts (which underwrite ggplot charts, for example), as interactive d3.js charts...
#http://mpld3.github.io/
#!pip3 install mpld3
import mpld3
#Running the following command means charts generated in this notebook henceforth will be rendered using D3.js
mpld3.enable_notebook()
ggplot(df_melt,aes(x='dt',y='value',colour='variable'))+geom_line()
<ggplot: (-9223363260181667503)>
If you hover over the chart (in the live notebook at least) you should see an interactive toolbar pop up bottom left that will let you zoom in on, and pan over, the chart.
Magic, eh?