Pooled Classification¶

A common workflow with longitudinal spatial data is to apply the same classification scheme to an attribute over different time periods. More specifically, one would like to keep the class breaks the same over each period and examine how the mass of the distribution changes over these classes in the different periods.

The Pooled classifier supports this workflow.

In [44]:

import numpy as np
import mapclassify as mc

Sample Data¶

We construct a synthetic dataset composed of 20 cross-sectional units at three time points. Here the mean of the series is increasing over time.

In [45]:

n = 20
data = np.array([np.arange(n)+i*n for i in range(1,4)]).T

In [46]:

data.shape

Out[46]:

(20, 3)

In [47]:

data

Out[47]:

array([[20, 40, 60],
       [21, 41, 61],
       [22, 42, 62],
       [23, 43, 63],
       [24, 44, 64],
       [25, 45, 65],
       [26, 46, 66],
       [27, 47, 67],
       [28, 48, 68],
       [29, 49, 69],
       [30, 50, 70],
       [31, 51, 71],
       [32, 52, 72],
       [33, 53, 73],
       [34, 54, 74],
       [35, 55, 75],
       [36, 56, 76],
       [37, 57, 77],
       [38, 58, 78],
       [39, 59, 79]])

Default: Quintiles¶

The default is to apply a vec operator to the data matrix and treat the observations as a single collection. Here the quantiles of the pooled data are obtained.

In [48]:

res = mc.Pooled(data)

In [49]:

res

Out[49]:

Pooled Classifier

Pooled Quantiles      

   Interval      Count
----------------------
[20.00, 31.80] |    12
(31.80, 43.60] |     8
(43.60, 55.40] |     0
(55.40, 67.20] |     0
(67.20, 79.00] |     0

Pooled Quantiles      

   Interval      Count
----------------------
( -inf, 31.80] |     0
(31.80, 43.60] |     4
(43.60, 55.40] |    12
(55.40, 67.20] |     4
(67.20, 79.00] |     0

Pooled Quantiles      

   Interval      Count
----------------------
( -inf, 31.80] |     0
(31.80, 43.60] |     0
(43.60, 55.40] |     0
(55.40, 67.20] |     8
(67.20, 79.00] |    12

Note that the class definitions are identical with the exception of the lower bound in the first period. Since the first period contains the minimum value in the pooled series, that value defines the closed lower bound in the first period. In subsequent periods, the local minimums are all greater than the closed upper bound on the first interval - in other words the local minimums are not contained in the first class for the second and third periods. Following the policy in mapclassify, the lower bounds for the second and third periods are both set to -inf to indicate that their minimum values are not contained in the first class.

In [50]:

res = mc.Pooled(data, k=4)

In [51]:

res.col_classifiers[0].counts

Out[51]:

array([15,  5,  0,  0])

In [52]:

res.col_classifiers[-1].counts

Out[52]:

array([ 0,  0,  5, 15])

In [53]:

res.global_classifier.counts

Out[53]:

array([15, 15, 15, 15])

In [54]:

res

Out[54]:

Pooled Classifier

Pooled Quantiles      

   Interval      Count
----------------------
[20.00, 34.75] |    15
(34.75, 49.50] |     5
(49.50, 64.25] |     0
(64.25, 79.00] |     0

Pooled Quantiles      

   Interval      Count
----------------------
( -inf, 34.75] |     0
(34.75, 49.50] |    10
(49.50, 64.25] |    10
(64.25, 79.00] |     0

Pooled Quantiles      

   Interval      Count
----------------------
( -inf, 34.75] |     0
(34.75, 49.50] |     0
(49.50, 64.25] |     5
(64.25, 79.00] |    15

Extract the pooled classification objects for each column

In [55]:

c0, c1, c2 = res.col_classifiers

In [56]:

c0

Out[56]:

Pooled Quantiles      

   Interval      Count
----------------------
[20.00, 34.75] |    15
(34.75, 49.50] |     5
(49.50, 64.25] |     0
(64.25, 79.00] |     0

Compare to the unrestricted classifier for the first column

In [57]:

mc.Quantiles(c0.y, k=4)

Out[57]:

Quantiles             

   Interval      Count
----------------------
[20.00, 24.75] |     5
(24.75, 29.50] |     5
(29.50, 34.25] |     5
(34.25, 39.00] |     5

and the last column comparisions

In [58]:

c2

Out[58]:

Pooled Quantiles      

   Interval      Count
----------------------
( -inf, 34.75] |     0
(34.75, 49.50] |     0
(49.50, 64.25] |     5
(64.25, 79.00] |    15

In [59]:

mc.Quantiles(c2.y, k=4)

Out[59]:

Quantiles             

   Interval      Count
----------------------
[60.00, 64.75] |     5
(64.75, 69.50] |     5
(69.50, 74.25] |     5
(74.25, 79.00] |     5

Non-default classifier: BoxPlot¶

In [60]:

res = mc.Pooled(data, classifier='BoxPlot', hinge=1.5)

In [61]:

res

Out[61]:

Pooled Classifier

Pooled BoxPlot          

    Interval       Count
------------------------
(  -inf,  -9.50] |     0
( -9.50,  34.75] |    15
( 34.75,  49.50] |     5
( 49.50,  64.25] |     0
( 64.25, 108.50] |     0

Pooled BoxPlot          

    Interval       Count
------------------------
(  -inf,  -9.50] |     0
( -9.50,  34.75] |     0
( 34.75,  49.50] |    10
( 49.50,  64.25] |    10
( 64.25, 108.50] |     0

Pooled BoxPlot          

    Interval       Count
------------------------
(  -inf,  -9.50] |     0
( -9.50,  34.75] |     0
( 34.75,  49.50] |     0
( 49.50,  64.25] |     5
( 64.25, 108.50] |    15

In [62]:

res.col_classifiers[0].bins

Out[62]:

array([ -9.5 ,  34.75,  49.5 ,  64.25, 108.5 ])

In [63]:

c0, c1, c2 = res.col_classifiers

In [64]:

c0.yb

Out[64]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2])

In [65]:

c00 = mc.BoxPlot(c0.y, hinge=3)

In [66]:

c00.yb

Out[66]:

array([1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4])

In [67]:

c00

Out[67]:

BoxPlot               

   Interval      Count
----------------------
( -inf, -3.75] |     0
(-3.75, 24.75] |     5
(24.75, 29.50] |     5
(29.50, 34.25] |     5
(34.25, 62.75] |     5

In [68]:

c0

Out[68]:

Pooled BoxPlot          

    Interval       Count
------------------------
(  -inf,  -9.50] |     0
( -9.50,  34.75] |    15
( 34.75,  49.50] |     5
( 49.50,  64.25] |     0
( 64.25, 108.50] |     0

Non-default classifier: FisherJenks¶

In [69]:

res = mc.Pooled(data, classifier='FisherJenks', k=5)

In [70]:

res

Out[70]:

Pooled Classifier

Pooled FisherJenks    

   Interval      Count
----------------------
[20.00, 31.00] |    12
(31.00, 43.00] |     8
(43.00, 55.00] |     0
(55.00, 67.00] |     0
(67.00, 79.00] |     0

Pooled FisherJenks    

   Interval      Count
----------------------
( -inf, 31.00] |     0
(31.00, 43.00] |     4
(43.00, 55.00] |    12
(55.00, 67.00] |     4
(67.00, 79.00] |     0

Pooled FisherJenks    

   Interval      Count
----------------------
( -inf, 31.00] |     0
(31.00, 43.00] |     0
(43.00, 55.00] |     0
(55.00, 67.00] |     8
(67.00, 79.00] |    12

In [71]:

c0, c1, c2 = res.col_classifiers
mc.FisherJenks(c0.y, k=5)

Out[71]:

FisherJenks           

   Interval      Count
----------------------
[20.00, 23.00] |     4
(23.00, 27.00] |     4
(27.00, 31.00] |     4
(31.00, 35.00] |     4
(35.00, 39.00] |     4

Non-default classifier: MaximumBreaks¶

In [72]:

data[1, 0] = 10
data[1, 1] = 10
data[1, 2] = 10
data[9, 2] = 10
data

Out[72]:

array([[20, 40, 60],
       [10, 10, 10],
       [22, 42, 62],
       [23, 43, 63],
       [24, 44, 64],
       [25, 45, 65],
       [26, 46, 66],
       [27, 47, 67],
       [28, 48, 68],
       [29, 49, 10],
       [30, 50, 70],
       [31, 51, 71],
       [32, 52, 72],
       [33, 53, 73],
       [34, 54, 74],
       [35, 55, 75],
       [36, 56, 76],
       [37, 57, 77],
       [38, 58, 78],
       [39, 59, 79]])

In [73]:

res = mc.Pooled(data, classifier='MaximumBreaks', k=5)

In [74]:

res

Out[74]:

Pooled Classifier

Pooled MaximumBreaks  

   Interval      Count
----------------------
[10.00, 15.00] |     1
(15.00, 21.00] |     1
(21.00, 41.00] |    18
(41.00, 61.00] |     0
(61.00, 79.00] |     0

Pooled MaximumBreaks  

   Interval      Count
----------------------
[10.00, 15.00] |     1
(15.00, 21.00] |     0
(21.00, 41.00] |     1
(41.00, 61.00] |    18
(61.00, 79.00] |     0

Pooled MaximumBreaks  

   Interval      Count
----------------------
[10.00, 15.00] |     2
(15.00, 21.00] |     0
(21.00, 41.00] |     0
(41.00, 61.00] |     1
(61.00, 79.00] |    17

In [75]:

c0, c1, c2 = res.col_classifiers

In [76]:

c0

Out[76]:

Pooled MaximumBreaks  

   Interval      Count
----------------------
[10.00, 15.00] |     1
(15.00, 21.00] |     1
(21.00, 41.00] |    18
(41.00, 61.00] |     0
(61.00, 79.00] |     0

In [77]:

mc.MaximumBreaks(c0.y, k=5)

Insufficient number of unique diffs. Breaks are random.

Out[77]:

MaximumBreaks         

   Interval      Count
----------------------
[10.00, 15.00] |     1
(15.00, 21.00] |     1
(21.00, 22.50] |     1
(22.50, 28.50] |     6
(28.50, 39.00] |    11

In [78]:

res = mc.Pooled(data, classifier='UserDefined', bins=mc.Quantiles(data[:,-1]).bins)

In [79]:

res

Out[79]:

Pooled Classifier

Pooled UserDefined    

   Interval      Count
----------------------
[10.00, 62.80] |    20
(62.80, 66.60] |     0
(66.60, 71.40] |     0
(71.40, 75.20] |     0
(75.20, 79.00] |     0

Pooled UserDefined    

   Interval      Count
----------------------
[10.00, 62.80] |    20
(62.80, 66.60] |     0
(66.60, 71.40] |     0
(71.40, 75.20] |     0
(75.20, 79.00] |     0

Pooled UserDefined    

   Interval      Count
----------------------
[10.00, 62.80] |     4
(62.80, 66.60] |     4
(66.60, 71.40] |     4
(71.40, 75.20] |     4
(75.20, 79.00] |     4

In [80]:

mc.Quantiles(data[:,-1])

Out[80]:

Quantiles             

   Interval      Count
----------------------
[10.00, 62.80] |     4
(62.80, 66.60] |     4
(66.60, 71.40] |     4
(71.40, 75.20] |     4
(75.20, 79.00] |     4

In [81]:

data[:,-1]

Out[81]:

array([60, 10, 62, 63, 64, 65, 66, 67, 68, 10, 70, 71, 72, 73, 74, 75, 76,
       77, 78, 79])

Pinning the pooling¶

Another option is to specify a specific subperiod as the definition for the classes in the pooling.

Pinning to the last period¶

As an example, we can use the quintles from the third period to defined the pooled classifier:

In [82]:

pinned = mc.Pooled(data, classifier='UserDefined', bins=mc.Quantiles(data[:,-1]).bins)

In [83]:

pinned

Out[83]:

Pooled Classifier

Pooled UserDefined    

   Interval      Count
----------------------
[10.00, 62.80] |    20
(62.80, 66.60] |     0
(66.60, 71.40] |     0
(71.40, 75.20] |     0
(75.20, 79.00] |     0

Pooled UserDefined    

   Interval      Count
----------------------
[10.00, 62.80] |    20
(62.80, 66.60] |     0
(66.60, 71.40] |     0
(71.40, 75.20] |     0
(75.20, 79.00] |     0

Pooled UserDefined    

   Interval      Count
----------------------
[10.00, 62.80] |     4
(62.80, 66.60] |     4
(66.60, 71.40] |     4
(71.40, 75.20] |     4
(75.20, 79.00] |     4

In [84]:

pinned.global_classifier

Out[84]:

UserDefined           

   Interval      Count
----------------------
[10.00, 62.80] |    44
(62.80, 66.60] |     4
(66.60, 71.40] |     4
(71.40, 75.20] |     4
(75.20, 79.00] |     4

Pinning to the first period¶

In [85]:

pinned = mc.Pooled(data, classifier='UserDefined', bins=mc.Quantiles(data[:,0]).bins)

In [86]:

pinned

Out[86]:

Pooled Classifier

Pooled UserDefined    

   Interval      Count
----------------------
[10.00, 23.80] |     4
(23.80, 27.60] |     4
(27.60, 31.40] |     4
(31.40, 35.20] |     4
(35.20, 39.00] |     4
(39.00, 79.00] |     0

Pooled UserDefined    

   Interval      Count
----------------------
[10.00, 23.80] |     1
(23.80, 27.60] |     0
(27.60, 31.40] |     0
(31.40, 35.20] |     0
(35.20, 39.00] |     0
(39.00, 79.00] |    19

Pooled UserDefined    

   Interval      Count
----------------------
[10.00, 23.80] |     2
(23.80, 27.60] |     0
(27.60, 31.40] |     0
(31.40, 35.20] |     0
(35.20, 39.00] |     0
(39.00, 79.00] |    18

Note that the quintiles for the first period, by definition, contain all the values from that period, they do not bound the larger values in subsequent period. Following the mapclassify policy, an additional class is added to contain all values in the pooled series.