Weather Analysis using MapReduce (Part 4)¶

In [1]:

import pandas as pd
import numpy as np
import sklearn as sk

print 'pandas version: ',pd.__version__
print 'numpy version:',np.__version__
print 'sklearn version:',sk.__version__

pandas version:  0.13.1
numpy version: 1.8.1
sklearn version: 0.14.1

In [2]:

import sys
home_dir='/home/ubuntu/UCSD_BigData'
sys.path.append(home_dir+'/utils')
from find_waiting_flow import *
from AWS_keypair_management import *

4. Merging partitions using medium description length (MDL)¶

In the previous section, we have computed the mean vector and covariance matrix for each partition and have performed the complete PCA for a single partition.

Our final task now is to run the PCA for all partitions and merge them, using the medium description length (MDL) as a measure.

The MDL criterion for when to merge two regions $1$ and $2$, to a new region $3$ is

$$n_1\cdot k_1+(k_1+1)\cdot(2\times 365)+n_2\cdot k_2+(k_2+1)\cdot(2\times 365) > n_3\cdot k_3+(k_3+1)\cdot(2\times 365)$$

where $k_i$ is the number of required eigenvectors for region $i$ to explain 99% of the variance and $n_i$ are the number of measurements in region $i$.

In order to do that, we will proceed as follows:

Find neighbors for each partition
Store neighbor information in a graph
Merge partitions based on MDL

Find neighbors for each partition¶

In [3]:

import pickle

dfFinal = pickle.load(open('finalTable.pkl', 'rb'))

dfBounds = dfFinal.ix[:,['partitionID','lon_min','lon_max','lat_min','lat_max']]
dfBounds = dfBounds.drop_duplicates('partitionID').set_index('partitionID')
dfBounds.head()

Out[3]:

	lon_min	lon_max	lat_min	lat_max
partitionID
7126	37.1499	47.7500	41.01655	42.05000
3565	47.7500	69.2335	40.63300	42.05000
7125	37.1499	47.7500	40.70850	41.01655
3563	47.7500	71.5250	39.43350	40.63300
3560	44.1250	47.7500	39.43350	40.70850

5 rows × 4 columns

Two partitions are neighbours if

they share at least one common longitude or latitude as a border
the shared borders overlap

The following generator checks these two criterion for a given partition with respect to all other partitions and returns the neighbors.

In [4]:

def find_neighbors(partitionID):
    bounds0 = np.array(dfBounds.ix[partitionID,:])
    for p in dfBounds.index.values:
        bounds = np.array(dfBounds.ix[p,['lon_min','lon_max','lat_min','lat_max']])
        #Check whether the two partitions share a median
        sharedlon = any([i in bounds0[:2] for i in bounds[:2]])
        sharedlat = any([i in bounds0[2:] for i in bounds[2:]])
        if sharedlon:
            #Check whether the bounds of two partitions overlap
            if bounds[2]<bounds0[2]<bounds[3] or bounds[2]<bounds0[3]<bounds[3] \
            or bounds0[2]<bounds[2]<bounds0[3] or bounds0[2]<bounds[3]<bounds0[3]:
                yield int(p)
            elif sharedlat: 
                #Check whether the bounds of two partitions overlap
                if bounds[0]<bounds0[0]<bounds[1] or bounds[0]<bounds0[1]<bounds[1] \
                or bounds0[0]<bounds[0]<bounds0[1] or bounds0[0]<bounds[1]<bounds0[1]:
                    yield int(p)

To provide an example, for partition 3560, we get the neighbors

In [5]:

neighbours = list(find_neighbors(3560))
print neighbours

[3565, 7125, 3563]

With the help of this generator, we want to build a graph. We will take advantage of the dictionary structure in Python to save all neighbors for all partitions.

In [6]:

partitions = dfBounds.index.values
print partitions

[7126.0 3565.0 7125.0 ..., 3847.0 7693.0 3841.0]

In [107]:

%%time

g = dict()
i = 0
for p in partitions:
    if i%100==0:
        print i
    g[int(p)] = list(find_neighbors(p))
    i = i+1

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
CPU times: user 1h 5min 31s, sys: 2.27 s, total: 1h 5min 33s
Wall time: 1h 16min 6s

In [108]:

# Save partition graph as pickle file
with open('partition_graph.pkl','wb') as file:
    pickle.dump(g,file, pickle.HIGHEST_PROTOCOL)

In [8]:

g = pickle.load(open('partition_graph.pkl', 'rb'))

In [9]:

print g.items()[90:100] 

[(2135, [2129, 2124, 2118, 2137]), (2136, [2138, 2139, 2141]), (2137, [2135, 2124, 2126]), (2138, [2141, 2136]), (2139, [2141, 2136]), (2140, [2337, 2335, 2134, 2142]), (2141, [2138, 2139, 2136]), (2142, [2343, 2337, 2140, 2164]), (2143, [2078, 2121, 2076, 2102, 2100, 2145]), (2144, [2149, 2147, 2146])]

Merging partitions¶

Let us recall that the covariance matrix of a $M\times N$ matrix $X$ can be computed via

$$ Cov(X) = \frac{1}{N}\sum^N_{i=1} (\boldsymbol{x_i}-\boldsymbol{\mu}) (\boldsymbol{x_i}-\boldsymbol{\mu})^T = \frac{1}{N} \sum^N_{i=1} \boldsymbol{x_i}\boldsymbol{x_i}^T - \boldsymbol{\mu}\boldsymbol{\mu}^T $$

where $\boldsymbol{x_i}, i = 1,...,N$ are the column vectors of matrix $X$ and

$$\boldsymbol{\mu}=\frac{1}{N}\sum^N_{i=1} \boldsymbol{x_i}$$

is the mean vector of all column vectors of matrix $X$.

Our goal now is to calculate the covariance matrix of a merged region given that we know the covariance matrices and mean vectors of the partitions to be merged.

Suppose we are given two matrices $X$ and $Y$ with dimension $M\times N_x$ and $M\times N_y$, respectively. Furthermore, we know their mean vectors $\boldsymbol{\mu_x}, \boldsymbol{\mu_y}$ , as well as their covariances $Cov(X)$ and $Cov(Y)$. Now, consider the concatenated matrix $Z = [X,Y]$ with dimension $M\times N_z$, where $N_z=N_x+N_y$. In our case, $Z$ would represent the measurements of the merged region.

The covariance of $Z$ can then be expressed as

$$ Cov(Z) = \frac{1}{N_z} \sum^{N_z}_{i=1} \boldsymbol{z_i}\boldsymbol{z_i}^T - \boldsymbol{\mu_z}\boldsymbol{\mu_z}^T = \frac{1}{N_x+N_y} [ \sum^{N_x}_{i=1} \boldsymbol{x_i}\boldsymbol{x_i}^T + \sum^{N_y}_{i=1} \boldsymbol{y_i}\boldsymbol{y_i}^T ] - \boldsymbol{\mu_z}\boldsymbol{\mu_z}^T $$$$ = \frac{1}{N_x+N_y} [ N_x(Cov(X)+\boldsymbol{\mu_x}\boldsymbol{\mu_x}^T) + Ny(Cov(Y)+\boldsymbol{\mu_y}\boldsymbol{\mu_y}^T)] - \boldsymbol{\mu_z}\boldsymbol{\mu_z}^T $$

where

$$ \boldsymbol{\mu_z}=\frac{N_x\boldsymbol{\mu_x}+N_y\boldsymbol{\mu_y}}{N_x+N_y} $$

is the mean vector of matrix $Z$.

In [10]:

dfCov = pickle.load(open('covTable.pkl', 'rb'))

Weather Analysis using MapReduce (Part 4)¶

4. Merging partitions using medium description length (MDL)¶

Find neighbors for each partition¶

Merging partitions¶

To be continued...¶