by Robin Wilson (robin@rtwilson.com)
Import relevant libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Read the CSV data which was exported from the GIS data using QGIS
df = pd.read_csv('BuiltUpAreas_WithLeaf_WithArea.csv')
df.columns
Index(['bua11cd', 'bua11nm', 'bua_id', 'has_sd', 'sd_count', 'label', 'name', 'Leaf_mean', 'Leaf_media', 'Leaf_stdev', 'Leaf_min', 'Leaf_max', 'area', 'perimeter'], dtype='object')
How many rows are there to start with?
len(df)
5360
How many rows if we exclude BUA's under 10km^2
len(df[df.area > 10000000])
161
Ok, lets do the rest of our analysis with just these large areas
large = df[df.area > 10000000]
What are the top areas if we sort by mean leafiness?
large.sort_values("Leaf_mean", ascending=False).head()
bua11cd | bua11nm | bua_id | has_sd | sd_count | label | name | Leaf_mean | Leaf_media | Leaf_stdev | Leaf_min | Leaf_max | area | perimeter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
251 | E34004906 | Winchester BUA | 4605 | Y | 2 | E34004906 | Winchester BUA | 0.232308 | 0.225365 | 0.091031 | 0.006093 | 0.579025 | 1.232749e+07 | 52200.004 |
4763 | E34004756 | Northwich BUA | 5080 | Y | 2 | E34004756 | Northwich BUA | 0.181535 | 0.176116 | 0.061717 | -0.021819 | 0.478769 | 1.441000e+07 | 104700.214 |
4629 | E34004560 | Maidenhead BUA | 2820 | Y | 5 | E34004560 | Maidenhead BUA | 0.181376 | 0.175004 | 0.075712 | -0.056126 | 0.528889 | 1.853756e+07 | 93600.066 |
16 | E34001481 | Heswall BUA | 724 | N | 0 | E34001481 | Heswall BUA | 0.167553 | 0.163465 | 0.047651 | 0.037823 | 0.393085 | 1.017750e+07 | 42999.946 |
257 | E34004941 | Worcester BUA | 1265 | Y | 2 | E34004941 | Worcester BUA | 0.160467 | 0.155388 | 0.048406 | -0.008730 | 0.441869 | 2.466749e+07 | 96100.084 |
What are the top areas if we sort by median leafiness?
large.sort_values("Leaf_media", ascending=False).head()
bua11cd | bua11nm | bua_id | has_sd | sd_count | label | name | Leaf_mean | Leaf_media | Leaf_stdev | Leaf_min | Leaf_max | area | perimeter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
251 | E34004906 | Winchester BUA | 4605 | Y | 2 | E34004906 | Winchester BUA | 0.232308 | 0.225365 | 0.091031 | 0.006093 | 0.579025 | 1.232749e+07 | 52200.004 |
4763 | E34004756 | Northwich BUA | 5080 | Y | 2 | E34004756 | Northwich BUA | 0.181535 | 0.176116 | 0.061717 | -0.021819 | 0.478769 | 1.441000e+07 | 104700.214 |
4629 | E34004560 | Maidenhead BUA | 2820 | Y | 5 | E34004560 | Maidenhead BUA | 0.181376 | 0.175004 | 0.075712 | -0.056126 | 0.528889 | 1.853756e+07 | 93600.066 |
16 | E34001481 | Heswall BUA | 724 | N | 0 | E34001481 | Heswall BUA | 0.167553 | 0.163465 | 0.047651 | 0.037823 | 0.393085 | 1.017750e+07 | 42999.946 |
218 | E34004294 | Great Malvern BUA | 1268 | N | 0 | E34004294 | Great Malvern BUA | 0.159090 | 0.157340 | 0.046503 | -0.017508 | 0.425216 | 1.352000e+07 | 86700.022 |
What are the lowest areas if we sort by mean leafiness
large.sort_values("Leaf_mean", ascending=True).head()
bua11cd | bua11nm | bua_id | has_sd | sd_count | label | name | Leaf_mean | Leaf_media | Leaf_stdev | Leaf_min | Leaf_max | area | perimeter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4911 | E34004978 | Grays BUA | 5708 | Y | 4 | E34004978 | Grays BUA | 0.046586 | 0.045719 | 0.025500 | -0.040814 | 0.195052 | 2.635254e+07 | 1.115000e+05 |
4744 | E34004846 | Thanet BUA | 4243 | Y | 3 | E34004846 | Thanet BUA | 0.050020 | 0.043096 | 0.045293 | -0.166400 | 0.347691 | 2.789248e+07 | 9.520007e+04 |
142 | E34004707 | Greater London BUA | 5705 | Y | 104 | E34004707 | Greater London BUA | 0.061683 | 0.057584 | 0.036802 | -0.095982 | 0.583697 | 1.737855e+09 | 3.256799e+06 |
4708 | E34004682 | Stevenage BUA | 3252 | Y | 2 | E34004682 | Stevenage BUA | 0.063386 | 0.062211 | 0.021998 | -0.017950 | 0.178562 | 2.189499e+07 | 5.230007e+04 |
4833 | E34004858 | Exeter BUA | 4 | Y | 3 | E34004858 | Exeter BUA | 0.066929 | 0.062454 | 0.031770 | -0.025321 | 0.275862 | 2.849248e+07 | 1.078999e+05 |
What are the areas that have the most variability in their leafiness? Each area has a very different mean leafiness value, so we can't just compare standard deviation values. Instead, we'll calculate the co-efficient of variation (standard deviation as a proportion of the mean) and look at the variability in that.
large['Leaf_cv'] = large.Leaf_stdev / large.Leaf_mean
/Users/robin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.
large.sort_values("Leaf_cv", ascending=False).head()
bua11cd | bua11nm | bua_id | has_sd | sd_count | label | name | Leaf_mean | Leaf_media | Leaf_stdev | Leaf_min | Leaf_max | area | perimeter | Leaf_cv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4744 | E34004846 | Thanet BUA | 4243 | Y | 3 | E34004846 | Thanet BUA | 0.050020 | 0.043096 | 0.045293 | -0.166400 | 0.347691 | 2.789248e+07 | 9.520007e+04 | 0.905490 |
308 | E34004644 | Felixstowe BUA | 4247 | Y | 2 | E34004644 | Felixstowe BUA | 0.088974 | 0.079795 | 0.059855 | -0.086180 | 0.437179 | 1.206250e+07 | 4.910002e+04 | 0.672721 |
306 | E34004640 | Reading BUA | 5244 | Y | 8 | E34004640 | Reading BUA | 0.086041 | 0.076874 | 0.055083 | -0.093819 | 0.477467 | 8.369745e+07 | 2.895002e+05 | 0.640186 |
4866 | E34004900 | Blackpool BUA | 707 | Y | 7 | E34004900 | Blackpool BUA | 0.076584 | 0.070172 | 0.047481 | -0.052320 | 0.376961 | 6.126501e+07 | 1.720999e+05 | 0.619988 |
81 | E34005054 | Greater Manchester BUA | 5071 | Y | 72 | E34005054 | Greater Manchester BUA | 0.097402 | 0.084551 | 0.059387 | -0.151687 | 0.581677 | 6.302525e+08 | 1.630301e+06 | 0.609703 |
and the least variability?
large.sort_values('Leaf_cv', ascending=True).head()
bua11cd | bua11nm | bua_id | has_sd | sd_count | label | name | Leaf_mean | Leaf_media | Leaf_stdev | Leaf_min | Leaf_max | area | perimeter | Leaf_cv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
70 | E34004403 | Bath BUA | 1266 | N | 0 | E34004403 | Bath BUA | 0.129506 | 0.126995 | 0.027902 | 0.031190 | 0.307061 | 2.423750e+07 | 92099.940000 | 0.215446 |
22 | E34001962 | Yeovil BUA | 722 | N | 0 | E34001962 | Yeovil BUA | 0.124182 | 0.122399 | 0.028439 | 0.023708 | 0.302564 | 1.256501e+07 | 42000.094000 | 0.229008 |
4653 | E34004587 | Stafford BUA | 4492 | Y | 3 | E34004587 | Stafford BUA | 0.150732 | 0.148805 | 0.037492 | 0.026252 | 0.365269 | 2.045752e+07 | 90700.098005 | 0.248732 |
230 | E34005036 | York BUA | 2279 | Y | 2 | E34005036 | York BUA | 0.105797 | 0.104355 | 0.026749 | 0.004041 | 0.260249 | 3.402500e+07 | 118499.958002 | 0.252833 |
175 | E34004487 | Durham BUA | 5436 | N | 0 | E34004487 | Durham BUA | 0.147859 | 0.144863 | 0.039650 | 0.011482 | 0.320897 | 1.286749e+07 | 50299.956000 | 0.268158 |
Now let's have a look at this on a graph...
from bokeh.plotting import figure, ColumnDataSource
from bokeh.models import HoverTool
def scatter_with_hover(df, x, y,
fig=None, cols=None, name=None, marker='x',
fig_width=500, fig_height=500, **kwargs):
"""
Plots an interactive scatter plot of `x` vs `y` using bokeh, with automatic
tooltips showing columns from `df`.
Parameters
----------
df : pandas.DataFrame
DataFrame containing the data to be plotted
x : str
Name of the column to use for the x-axis values
y : str
Name of the column to use for the y-axis values
fig : bokeh.plotting.Figure, optional
Figure on which to plot (if not given then a new figure will be created)
cols : list of str
Columns to show in the hover tooltip (default is to show all)
name : str
Bokeh series name to give to the scattered data
marker : str
Name of marker to use for scatter plot
**kwargs
Any further arguments to be passed to fig.scatter
Returns
-------
bokeh.plotting.Figure
Figure (the same as given, or the newly created figure)
Example
-------
fig = scatter_with_hover(df, 'A', 'B')
show(fig)
fig = scatter_with_hover(df, 'A', 'B', cols=['C', 'D', 'E'], marker='x', color='red')
show(fig)
Author
------
Robin Wilson <robin@rtwilson.com>
with thanks to Max Albert for original code example
"""
# If we haven't been given a Figure obj then create it with default
# size etc.
if fig is None:
fig = figure(width=fig_width, height=fig_height, tools=['box_zoom', 'reset'])
# We're getting data from the given dataframe
source = ColumnDataSource(data=df)
# We need a name so that we can restrict hover tools to just this
# particular 'series' on the plot. You can specify it (in case it
# needs to be something specific for other reasons), otherwise
# we just use 'main'
if name is None:
name = 'main'
# Actually do the scatter plot - the easy bit
# (other keyword arguments will be passed to this function)
fig.scatter(x, y, source=source, name=name, marker=marker, **kwargs)
# Now we create the hover tool, and make sure it is only active with
# the series we plotted in the previous line
hover = HoverTool(names=[name])
if cols is None:
# Display *all* columns in the tooltips
hover.tooltips = [(c, '@' + c) for c in df.columns]
else:
# Display just the given columns in the tooltips
hover.tooltips = [(c, '@' + c) for c in cols]
#hover.tooltips.append(('index', '$index'))
# Finally add/enable the tool
fig.add_tools(hover)
return fig
fig = scatter_with_hover(large, 'Leaf_mean', 'Leaf_cv', cols=['name'])
fig.xaxis.axis_label = "Leafiness Mean"
fig.yaxis.axis_label = "Leafiness CV"
from bokeh.io import output_notebook
from bokeh.plotting import show
show(fig)