In this notebook we will consider the question of if we can find different 'types' of cell towers based on usage activity.
As part of this we will have a few things to figure out:
We will touch on each of these considerations as we go. And like any DS project there are countless ways we could go about formulating the problem, below is an initial approach which no doubt could be iterated and improved on.
Imports and some settings...
import pandas as pd
import numpy as np
import glob
import hdbscan
from sklearn.cluster import KMeans
import seaborn as sns
import matplotlib.pyplot as plt
import geojson
import matplotlib.colors as colors
import matplotlib.cm as cmx
from descartes import PolygonPatch
%matplotlib inline
# set path to where data and files are
MY_DIR = r'C:\Users\Andrew\github\mobile-phone-activity'
Pull in each csv file one by one and then append them all into df_big and a smaller version of the full data df_small (for experimentation or quick development).
# get data
# get file paths
path = MY_DIR
files = glob.glob(path + "\data\sms-call-internet-mi-2013-11-*")
# empty list to store individual df's into
list_ = []
# loop through files
for file_ in files:
df_tmp = pd.read_csv(file_,index_col=None, header=0)
list_.append(df_tmp)
# concat all the df's into one
df_big = pd.concat(list_)
# make a small version of the full data for developing with
df_small = df_big.sample(n = 10000)
First we will just rename some columns and print out some summary stats and info on our data...
# pick big or small df to work with
#df = df_small.copy()
df = df_big.copy()
# clean up some col names
df.rename(columns={'CellID':'cell_id'}, inplace=True)
# ensure cell_id is a string (for count uniques in describe())
df['cell_id'] = df['cell_id'].astype(str)
# ensure datetime is correct format
df['datetime'] = pd.to_datetime(df['datetime'])
print(df.shape)
df.sample(5)
Print out some info on the data...
# look at some info about df
df.info(verbose=True)
Print out summary stats for each column...
df.describe(include='all')
In this case we are safe to set missing data values to be 0.
So we are assuming the absence of measurements for any given cell in a specific hour represents a lack of traffic as opposed to any issues with the data...
# get missing value percentages
print("% Missing Values")
print(df.isnull().sum()/len(df)*100)
...looks from above that there is indeed quite a bit of missing data here. Given the data is already aggregated up to cell and hour level i'd probably be asking questions on this a bit more. Is it really common enoungh for a cell to go a whole hour without any call's in or out?
Not much we can do about that here so we will take it at face value...
# fill all nulls with 0
df.fillna(value=0, inplace=True)
# look at a sample of data and keep track of the df shape as we do stuff
print(df.shape)
df.sample(5)
We will build some features to help us aggregate up from day and hour to "weekday" or "weekend" and "time of day" (night, day, or evening).
This will help us form a representation that should capture certain way's we think the data might behave while also being as compact as possible so as to avoid having lots of features feeding into the clustering which would be subject to the curse of dimensionality.
This is one point where we are making a specific design decision, as always its one we could play around with and revisit as we iterate (e.g. why not daily level or slice the day into different time bins).
# get if weekend or weekday from datetime
df['weekday'] = np.where((df['datetime'].dt.dayofweek < 5) , 'weekday', 'weekend')
# segment hours into times of day
def timeofday(hour):
if hour <= 6:
return 'night'
elif hour <= 17:
return 'day'
elif hour <= 23:
return 'evening'
df['timeofday'] = df['datetime'].dt.hour.apply(timeofday)
# make a single key field to describe day and time of traffic data
df['day_time'] = df['weekday'] + '_' + df['timeofday']
# drop some columns we no longer need
del df['datetime']
del df['countrycode']
del df['weekday']
del df['timeofday']
print(df.shape)
df.sample(5)
We will transform our data to be percentages - this idea here is to help with the interpretation and it also acts as a sort of way to normalize all our data to be on a similar scale which is key for distance measures when clustering.
Again, this is a design choice, assuming we are really interested in the bahaviour of the traffic in terms of its eb's and flows then it's reasonable to focus on percentages. However one trade off here is that any notion of the size of a cell in terms of the volume of traffic it handles is no longer possible.
So we will first get the total values for each metric across the whole week and then we will use that to transform into percentages - so for example "callsin" will no become the % of all traffic in the sample period for that specific cell that falls on that specific "day_time" slice.
# get cell totals for the whole week
df_cell_totals = df.groupby(['cell_id'], as_index=False).sum()
# get overall volumes regardless of direction
df_cell_totals['calltotal'] = abs(df_cell_totals['callin']) + abs(df_cell_totals['callout'])
df_cell_totals['smstotal'] = abs(df_cell_totals['smsin']) + abs(df_cell_totals['smsout'])
df_cell_totals['internettotal'] = df_cell_totals['internet']
print(df_cell_totals.shape)
df_cell_totals.sample(5)
Now we will aggregate up the low level data to cell and day_time level - summing all metrics as we go...
# sum up to day_time, cell level
df = df.groupby(['day_time','cell_id'], as_index=False).sum()
print(df.shape)
df.sample(5)
Now we will create df_pct dataframe where the metrics have all been transformed into percentages of cell total traffic for each metric.
# merge in cell totals
df_pct = pd.merge(df,df_cell_totals[['cell_id','internettotal','calltotal','smstotal']],how='left', on='cell_id')
# fill all nulls with 0
df_pct.fillna(value=0, inplace=True)
# convert measures to percentages of totals for that cell for the sample
df_pct['smsin'] = df_pct['smsin']/df_pct['smstotal']
df_pct['smsout'] = df_pct['smsout']/df_pct['smstotal']
df_pct['callin'] = df_pct['callin']/df_pct['calltotal']
df_pct['callout'] = df_pct['callout']/df_pct['calltotal']
df_pct['internet'] = df_pct['internet']/df_pct['internettotal']
# fill all nulls with 0
df_pct.fillna(value=0, inplace=True)
print(df_pct.shape)
df_pct.sample(5)
# summary stats
df_pct.describe(include='all')
As we are interested in clustering cell towers we need to transform our data to be one row per cell. This is typically called moving from a long format to a wide format whereby we will pivot our data so make more columns for each type of row relating to a cell.
So really we want to make our data wider with groups of metrics for each day_time value for each cell...
# now pivot to go wide with metrics for each day_time and one row per cell
df_wide = df_pct[['cell_id',
'day_time',
'smsin',
'smsout',
'callin',
'callout',
'internet']].pivot(index='cell_id', columns='day_time')
# fill in nulls
df_wide.fillna(0, inplace=True)
print(df_wide.shape)
df_wide.sample(5)
...We now have a hierarchial column index after our pivot - to make things easier we will flatten this and rename columns to be more meaningful...
# get new column names so we can flatten the nested index
new_col_names = []
# loop through each level and build up new col names
for c1 in df_wide.columns.levels[0]:
for c2 in df_wide.columns.levels[1]:
new_col_name = c1 + '_' + c2
new_col_names.append(new_col_name)
#print(new_col_names)
df_wide.columns = df_wide.columns.droplevel()
df_wide.columns = new_col_names
print(df_wide.shape)
df_wide.sample(5)
...note: we now have 10,000 rows of data as hoped - one per cell...
df_wide.info(verbose=True)
df_wide.describe(include='all')
Now we are ready to perform our clustering and explore the results...
# make a new df to use from here on
df_final = df_wide.copy()
print(df_final.shape)
df_final.sample(5)
Here we will just start simple with a k-means approach. We have iterated on a few different choices of k and settled on a final value.
We could use someting a bit more fancy like HDBSCAN which can fit more flexible cluster's and have a bit more of a data driven approach to the number of clusters. It also could be useful for helping pick out outlier cells - those for which no cluster was found to be close enough.
But to keep it simple for now we will stick with k-means...
# define clustering approach
#clusterer = hdbscan.HDBSCAN()
clusterer = KMeans(n_clusters=5)
# do the clustering
clusterer.fit(df_final)
print(clusterer.labels_.max())
# add the cluster labels back onto the input data
df_final['cluster'] = clusterer.labels_
#df_final['cluster_prob'] = clusterer.probabilities_ # onlt relvant for HDBSCAN
print(df_final.shape)
df_final.sample(5)
Lets look at the amount of cells in each cluster to get a feel for how cells have been distributed to each cluster...
# looks at % in each cluster
cluster_counts = df_final['cluster'].value_counts()
cluster_percents = cluster_counts / len(df_final['cluster'])
print(cluster_counts)
print(cluster_percents)
Now we will do some plots to help us explore and understand each cluster as well as get some feel for what the main differences between the clusters are...
# for each input metric plot a boxplot and violin plot to get a feel for how distributions differ by cluster
for var in df_final.columns[0:30]:
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(12,6))
ax1.set_title(var)
ax2.set_title(var)
sns.boxplot(x="cluster", y=var, data=df_final, ax=ax1)
sns.violinplot(x="cluster", y=var, data=df_final, ax=ax2)