Built-Up Area Leafiness analysis¶

by Robin Wilson (robin@rtwilson.com)

Import relevant libraries

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Read the CSV data which was exported from the GIS data using QGIS

In [2]:

df = pd.read_csv('BuiltUpAreas_WithLeaf_WithArea.csv')

In [3]:

df.columns

Out[3]:

Index(['bua11cd', 'bua11nm', 'bua_id', 'has_sd', 'sd_count', 'label', 'name',
       'Leaf_mean', 'Leaf_media', 'Leaf_stdev', 'Leaf_min', 'Leaf_max', 'area',
       'perimeter'],
      dtype='object')

How many rows are there to start with?

In [4]:

len(df)

Out[4]:

How many rows if we exclude BUA's under 10km^2

In [5]:

len(df[df.area > 10000000])

Out[5]:

Ok, lets do the rest of our analysis with just these large areas

In [6]:

large = df[df.area > 10000000]

What are the top areas if we sort by mean leafiness?

In [7]:

large.sort_values("Leaf_mean", ascending=False).head()

Out[7]:

	bua11cd	bua11nm	bua_id	has_sd	sd_count	label	name	Leaf_mean	Leaf_media	Leaf_stdev	Leaf_min	Leaf_max	area	perimeter
251	E34004906	Winchester BUA	4605	Y	2	E34004906	Winchester BUA	0.232308	0.225365	0.091031	0.006093	0.579025	1.232749e+07	52200.004
4763	E34004756	Northwich BUA	5080	Y	2	E34004756	Northwich BUA	0.181535	0.176116	0.061717	-0.021819	0.478769	1.441000e+07	104700.214
4629	E34004560	Maidenhead BUA	2820	Y	5	E34004560	Maidenhead BUA	0.181376	0.175004	0.075712	-0.056126	0.528889	1.853756e+07	93600.066
16	E34001481	Heswall BUA	724	N	0	E34001481	Heswall BUA	0.167553	0.163465	0.047651	0.037823	0.393085	1.017750e+07	42999.946
257	E34004941	Worcester BUA	1265	Y	2	E34004941	Worcester BUA	0.160467	0.155388	0.048406	-0.008730	0.441869	2.466749e+07	96100.084

What are the top areas if we sort by median leafiness?

In [8]:

large.sort_values("Leaf_media", ascending=False).head()

Out[8]:

	bua11cd	bua11nm	bua_id	has_sd	sd_count	label	name	Leaf_mean	Leaf_media	Leaf_stdev	Leaf_min	Leaf_max	area	perimeter
251	E34004906	Winchester BUA	4605	Y	2	E34004906	Winchester BUA	0.232308	0.225365	0.091031	0.006093	0.579025	1.232749e+07	52200.004
4763	E34004756	Northwich BUA	5080	Y	2	E34004756	Northwich BUA	0.181535	0.176116	0.061717	-0.021819	0.478769	1.441000e+07	104700.214
4629	E34004560	Maidenhead BUA	2820	Y	5	E34004560	Maidenhead BUA	0.181376	0.175004	0.075712	-0.056126	0.528889	1.853756e+07	93600.066
16	E34001481	Heswall BUA	724	N	0	E34001481	Heswall BUA	0.167553	0.163465	0.047651	0.037823	0.393085	1.017750e+07	42999.946
218	E34004294	Great Malvern BUA	1268	N	0	E34004294	Great Malvern BUA	0.159090	0.157340	0.046503	-0.017508	0.425216	1.352000e+07	86700.022

What are the lowest areas if we sort by mean leafiness

In [9]:

large.sort_values("Leaf_mean", ascending=True).head()

Out[9]:

	bua11cd	bua11nm	bua_id	has_sd	sd_count	label	name	Leaf_mean	Leaf_media	Leaf_stdev	Leaf_min	Leaf_max	area	perimeter
4911	E34004978	Grays BUA	5708	Y	4	E34004978	Grays BUA	0.046586	0.045719	0.025500	-0.040814	0.195052	2.635254e+07	1.115000e+05
4744	E34004846	Thanet BUA	4243	Y	3	E34004846	Thanet BUA	0.050020	0.043096	0.045293	-0.166400	0.347691	2.789248e+07	9.520007e+04
142	E34004707	Greater London BUA	5705	Y	104	E34004707	Greater London BUA	0.061683	0.057584	0.036802	-0.095982	0.583697	1.737855e+09	3.256799e+06
4708	E34004682	Stevenage BUA	3252	Y	2	E34004682	Stevenage BUA	0.063386	0.062211	0.021998	-0.017950	0.178562	2.189499e+07	5.230007e+04
4833	E34004858	Exeter BUA	4	Y	3	E34004858	Exeter BUA	0.066929	0.062454	0.031770	-0.025321	0.275862	2.849248e+07	1.078999e+05

What are the areas that have the most variability in their leafiness? Each area has a very different mean leafiness value, so we can't just compare standard deviation values. Instead, we'll calculate the co-efficient of variation (standard deviation as a proportion of the mean) and look at the variability in that.

In [10]:

large['Leaf_cv'] = large.Leaf_stdev / large.Leaf_mean

/Users/robin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

In [11]:

large.sort_values("Leaf_cv", ascending=False).head()

Out[11]:

	bua11cd	bua11nm	bua_id	has_sd	sd_count	label	name	Leaf_mean	Leaf_media	Leaf_stdev	Leaf_min	Leaf_max	area	perimeter	Leaf_cv
4744	E34004846	Thanet BUA	4243	Y	3	E34004846	Thanet BUA	0.050020	0.043096	0.045293	-0.166400	0.347691	2.789248e+07	9.520007e+04	0.905490
308	E34004644	Felixstowe BUA	4247	Y	2	E34004644	Felixstowe BUA	0.088974	0.079795	0.059855	-0.086180	0.437179	1.206250e+07	4.910002e+04	0.672721
306	E34004640	Reading BUA	5244	Y	8	E34004640	Reading BUA	0.086041	0.076874	0.055083	-0.093819	0.477467	8.369745e+07	2.895002e+05	0.640186
4866	E34004900	Blackpool BUA	707	Y	7	E34004900	Blackpool BUA	0.076584	0.070172	0.047481	-0.052320	0.376961	6.126501e+07	1.720999e+05	0.619988
81	E34005054	Greater Manchester BUA	5071	Y	72	E34005054	Greater Manchester BUA	0.097402	0.084551	0.059387	-0.151687	0.581677	6.302525e+08	1.630301e+06	0.609703

and the least variability?

In [12]:

large.sort_values('Leaf_cv', ascending=True).head()

Out[12]:

	bua11cd	bua11nm	bua_id	has_sd	sd_count	label	name	Leaf_mean	Leaf_media	Leaf_stdev	Leaf_min	Leaf_max	area	perimeter	Leaf_cv
70	E34004403	Bath BUA	1266	N	0	E34004403	Bath BUA	0.129506	0.126995	0.027902	0.031190	0.307061	2.423750e+07	92099.940000	0.215446
22	E34001962	Yeovil BUA	722	N	0	E34001962	Yeovil BUA	0.124182	0.122399	0.028439	0.023708	0.302564	1.256501e+07	42000.094000	0.229008
4653	E34004587	Stafford BUA	4492	Y	3	E34004587	Stafford BUA	0.150732	0.148805	0.037492	0.026252	0.365269	2.045752e+07	90700.098005	0.248732
230	E34005036	York BUA	2279	Y	2	E34005036	York BUA	0.105797	0.104355	0.026749	0.004041	0.260249	3.402500e+07	118499.958002	0.252833
175	E34004487	Durham BUA	5436	N	0	E34004487	Durham BUA	0.147859	0.144863	0.039650	0.011482	0.320897	1.286749e+07	50299.956000	0.268158

Now let's have a look at this on a graph...

In [13]:

from bokeh.plotting import figure, ColumnDataSource
from bokeh.models import HoverTool


def scatter_with_hover(df, x, y,
                       fig=None, cols=None, name=None, marker='x',
                       fig_width=500, fig_height=500, **kwargs):
    """
    Plots an interactive scatter plot of `x` vs `y` using bokeh, with automatic
    tooltips showing columns from `df`.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing the data to be plotted
    x : str
        Name of the column to use for the x-axis values
    y : str
        Name of the column to use for the y-axis values
    fig : bokeh.plotting.Figure, optional
        Figure on which to plot (if not given then a new figure will be created)
    cols : list of str
        Columns to show in the hover tooltip (default is to show all)
    name : str
        Bokeh series name to give to the scattered data
    marker : str
        Name of marker to use for scatter plot
    **kwargs
        Any further arguments to be passed to fig.scatter

    Returns
    -------
    bokeh.plotting.Figure
        Figure (the same as given, or the newly created figure)

    Example
    -------
    fig = scatter_with_hover(df, 'A', 'B')
    show(fig)

    fig = scatter_with_hover(df, 'A', 'B', cols=['C', 'D', 'E'], marker='x', color='red')
    show(fig)

    Author
    ------
    Robin Wilson <robin@rtwilson.com>
    with thanks to Max Albert for original code example
    """

    # If we haven't been given a Figure obj then create it with default
    # size etc.
    if fig is None:
        fig = figure(width=fig_width, height=fig_height, tools=['box_zoom', 'reset'])

    # We're getting data from the given dataframe
    source = ColumnDataSource(data=df)

    # We need a name so that we can restrict hover tools to just this
    # particular 'series' on the plot. You can specify it (in case it
    # needs to be something specific for other reasons), otherwise
    # we just use 'main'
    if name is None:
        name = 'main'

    # Actually do the scatter plot - the easy bit
    # (other keyword arguments will be passed to this function)
    fig.scatter(x, y, source=source, name=name, marker=marker, **kwargs)

    # Now we create the hover tool, and make sure it is only active with
    # the series we plotted in the previous line
    hover = HoverTool(names=[name])

    if cols is None:
        # Display *all* columns in the tooltips
        hover.tooltips = [(c, '@' + c) for c in df.columns]
    else:
        # Display just the given columns in the tooltips
        hover.tooltips = [(c, '@' + c) for c in cols]

    #hover.tooltips.append(('index', '$index'))

    # Finally add/enable the tool
    fig.add_tools(hover)

    return fig

In [ ]:

In [14]:

fig = scatter_with_hover(large, 'Leaf_mean', 'Leaf_cv', cols=['name'])
fig.xaxis.axis_label = "Leafiness Mean"
fig.yaxis.axis_label = "Leafiness CV"

In [15]:

from bokeh.io import output_notebook
from bokeh.plotting import show

In [16]:

output_notebook()

Loading BokehJS ...

In [17]:

show(fig)

In [ ]: