import pandas as pd
import numpy as np
import sklearn as sk
print 'pandas version: ',pd.__version__
print 'numpy version:',np.__version__
print 'sklearn version:',sk.__version__
pandas version: 0.13.1 numpy version: 1.8.1 sklearn version: 0.14.1
import sys
home_dir='/home/ubuntu/UCSD_BigData'
sys.path.append(home_dir+'/utils')
from find_waiting_flow import *
from AWS_keypair_management import *
In the previous section, we have computed the mean vector and covariance matrix for each partition and have performed the complete PCA for a single partition.
Our final task now is to run the PCA for all partitions and merge them, using the medium description length (MDL) as a measure.
The MDL criterion for when to merge two regions $1$ and $2$, to a new region $3$ is
$$n_1\cdot k_1+(k_1+1)\cdot(2\times 365)+n_2\cdot k_2+(k_2+1)\cdot(2\times 365) > n_3\cdot k_3+(k_3+1)\cdot(2\times 365)$$where $k_i$ is the number of required eigenvectors for region $i$ to explain 99% of the variance and $n_i$ are the number of measurements in region $i$.
In order to do that, we will proceed as follows:
import pickle
dfFinal = pickle.load(open('finalTable.pkl', 'rb'))
dfBounds = dfFinal.ix[:,['partitionID','lon_min','lon_max','lat_min','lat_max']]
dfBounds = dfBounds.drop_duplicates('partitionID').set_index('partitionID')
dfBounds.head()
lon_min | lon_max | lat_min | lat_max | |
---|---|---|---|---|
partitionID | ||||
7126 | 37.1499 | 47.7500 | 41.01655 | 42.05000 |
3565 | 47.7500 | 69.2335 | 40.63300 | 42.05000 |
7125 | 37.1499 | 47.7500 | 40.70850 | 41.01655 |
3563 | 47.7500 | 71.5250 | 39.43350 | 40.63300 |
3560 | 44.1250 | 47.7500 | 39.43350 | 40.70850 |
5 rows × 4 columns
Two partitions are neighbours if
The following generator checks these two criterion for a given partition with respect to all other partitions and returns the neighbors.
def find_neighbors(partitionID):
bounds0 = np.array(dfBounds.ix[partitionID,:])
for p in dfBounds.index.values:
bounds = np.array(dfBounds.ix[p,['lon_min','lon_max','lat_min','lat_max']])
#Check whether the two partitions share a median
sharedlon = any([i in bounds0[:2] for i in bounds[:2]])
sharedlat = any([i in bounds0[2:] for i in bounds[2:]])
if sharedlon:
#Check whether the bounds of two partitions overlap
if bounds[2]<bounds0[2]<bounds[3] or bounds[2]<bounds0[3]<bounds[3] \
or bounds0[2]<bounds[2]<bounds0[3] or bounds0[2]<bounds[3]<bounds0[3]:
yield int(p)
elif sharedlat:
#Check whether the bounds of two partitions overlap
if bounds[0]<bounds0[0]<bounds[1] or bounds[0]<bounds0[1]<bounds[1] \
or bounds0[0]<bounds[0]<bounds0[1] or bounds0[0]<bounds[1]<bounds0[1]:
yield int(p)
To provide an example, for partition 3560, we get the neighbors
neighbours = list(find_neighbors(3560))
print neighbours
[3565, 7125, 3563]
With the help of this generator, we want to build a graph. We will take advantage of the dictionary structure in Python to save all neighbors for all partitions.
partitions = dfBounds.index.values
print partitions
[7126.0 3565.0 7125.0 ..., 3847.0 7693.0 3841.0]
%%time
g = dict()
i = 0
for p in partitions:
if i%100==0:
print i
g[int(p)] = list(find_neighbors(p))
i = i+1
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 CPU times: user 1h 5min 31s, sys: 2.27 s, total: 1h 5min 33s Wall time: 1h 16min 6s
# Save partition graph as pickle file
with open('partition_graph.pkl','wb') as file:
pickle.dump(g,file, pickle.HIGHEST_PROTOCOL)
g = pickle.load(open('partition_graph.pkl', 'rb'))
print g.items()[90:100]
[(2135, [2129, 2124, 2118, 2137]), (2136, [2138, 2139, 2141]), (2137, [2135, 2124, 2126]), (2138, [2141, 2136]), (2139, [2141, 2136]), (2140, [2337, 2335, 2134, 2142]), (2141, [2138, 2139, 2136]), (2142, [2343, 2337, 2140, 2164]), (2143, [2078, 2121, 2076, 2102, 2100, 2145]), (2144, [2149, 2147, 2146])]
Let us recall that the covariance matrix of a $M\times N$ matrix $X$ can be computed via
$$ Cov(X) = \frac{1}{N}\sum^N_{i=1} (\boldsymbol{x_i}-\boldsymbol{\mu}) (\boldsymbol{x_i}-\boldsymbol{\mu})^T = \frac{1}{N} \sum^N_{i=1} \boldsymbol{x_i}\boldsymbol{x_i}^T - \boldsymbol{\mu}\boldsymbol{\mu}^T $$where $\boldsymbol{x_i}, i = 1,...,N$ are the column vectors of matrix $X$ and
$$\boldsymbol{\mu}=\frac{1}{N}\sum^N_{i=1} \boldsymbol{x_i}$$is the mean vector of all column vectors of matrix $X$.
Our goal now is to calculate the covariance matrix of a merged region given that we know the covariance matrices and mean vectors of the partitions to be merged.
Suppose we are given two matrices $X$ and $Y$ with dimension $M\times N_x$ and $M\times N_y$, respectively. Furthermore, we know their mean vectors $\boldsymbol{\mu_x}, \boldsymbol{\mu_y}$ , as well as their covariances $Cov(X)$ and $Cov(Y)$. Now, consider the concatenated matrix $Z = [X,Y]$ with dimension $M\times N_z$, where $N_z=N_x+N_y$. In our case, $Z$ would represent the measurements of the merged region.
The covariance of $Z$ can then be expressed as
$$ Cov(Z) = \frac{1}{N_z} \sum^{N_z}_{i=1} \boldsymbol{z_i}\boldsymbol{z_i}^T - \boldsymbol{\mu_z}\boldsymbol{\mu_z}^T = \frac{1}{N_x+N_y} [ \sum^{N_x}_{i=1} \boldsymbol{x_i}\boldsymbol{x_i}^T + \sum^{N_y}_{i=1} \boldsymbol{y_i}\boldsymbol{y_i}^T ] - \boldsymbol{\mu_z}\boldsymbol{\mu_z}^T $$$$ = \frac{1}{N_x+N_y} [ N_x(Cov(X)+\boldsymbol{\mu_x}\boldsymbol{\mu_x}^T) + Ny(Cov(Y)+\boldsymbol{\mu_y}\boldsymbol{\mu_y}^T)] - \boldsymbol{\mu_z}\boldsymbol{\mu_z}^T $$where
$$ \boldsymbol{\mu_z}=\frac{N_x\boldsymbol{\mu_x}+N_y\boldsymbol{\mu_y}}{N_x+N_y} $$is the mean vector of matrix $Z$.
dfCov = pickle.load(open('covTable.pkl', 'rb'))