Existen dos tipos de metodos para hacer clustering
El agrupamiento jerárquico (también llamado análisis de agrupamiento jerárquico o HCA) es un método de análisis de agrupamiento que busca construir una jerarquía de agrupamientos. Las estrategias para la agrupación jerárquica generalmente se dividen en dos tipos:
Aglomerativo: se trata de un enfoque "de abajo hacia arriba": cada observación comienza en su propio grupo y los pares de grupos se fusionan a medida que se asciende en la jerarquía.
Divisivo: se trata de un enfoque "de arriba hacia abajo": todas las observaciones comienzan en un grupo y las divisiones se realizan de forma recursiva a medida que uno se mueve hacia abajo en la jerarquía.
En general, las fusiones y divisiones se determinan de manera codiciosa. Los resultados de la agrupación jerárquica generalmente se presentan en un dendrograma.
Los más usados son los aglomerativos, conocidos en la literatura de Ciencias Naturales, como métodos de clasificación ascendente jerárquica aglomerativa (Sokal & Sneath 1963).
Estos métodos requieren de un índice de:
Se dispone de unos cuantos en la literatura dependiendo del tipo de variables y de las aplicaciones. En nuestro contexto se selecciona la distancia Euclidiana canónica.
Al conformar grupos se necesita definir una distancia entre ellos, que se denomina criterio de agregación y que le da nombre a un método especifico.
Las mas sencillas son el de enlace simple y enlace completo. El primero define la distancia como la que hay entre los dos individuos más cercanos cada uno de diferente grupo y el segundo entre los dos individuos más lejanos.
a) Buscar el menor valor en $D$: $d_{il}^0$: grupo $I_{il}^0$.
b) Calcular los índices de disimilitud entre $I_{il}^0$ y los demás individuos.
c) Eliminar las filas y columnas i y l e incluir la fila y columna $I_{il}^0$, para colocar las disimilitudes.
d) Volver a 3a y repetir hasta tener un solo grupo de n individuos
Las medidas de similitud evalúan el grado de parecido o proximidad existente entre dos elementos
Las medidas de disimilitud ponen el énfasis sobre el grado de diferencia o lejanía existente entre dos elementos. Los más altos indican mayor diferencia o lejanía entre los elementos comparados.
A un índice de similitud se le puede asociar un índice de disimilitud mediante la siguiente ecuación:
$$d(i,l)= s_{max} -s(i,l)$$Medidas de distancia La distancias entre dos individuos i y l se calculan a partir de filas respectivas de la matriz $X$, cuyas columnas son $p$ variables cuantitativas
Criterios de agregación selección de una similitud, disimilitud o distancia entre grupos.
Hay dos tipos:
Enlace simple: La distancia entre dos grupos A y B es igual a la distancia de los dos individuos de diferente grupo más cercanos $$d(A,B)= min[d(i,l); i \in A; l \in B]$$
Este criterio tiende a producir grupos alargados (efecto de encadenamiento), que pueden incluir elementos muy distintos en los extremos.
Enlace completo: La distancia entre dos grupos A y B es igual a la distancia de los dos individuos de diferente grupo más alejados $$d(A,B)= max[d(i,l); i \in A; l \in B]$$
Tiende a producir grupos esfericos
Para entender el proceso de construcción de un árbol usaremos un ejemplo de puntos sobre un plano, con la distancia de Manhattan entre puntos y el enlace completo como criterio de agregación, es decir, la distancia entre grupos
Pasos
Para lograr grupos que tengan inercia mínima intra-clases se debe utilizar una distancia Euclidiana y unir en cada paso del procedimiento los dos grupos que aumenten menos la inercia intra-clases, que corresponde al método de Ward (Ward 1963, Wishart 1969).
Si tenemos los 3 grupos de abajo para un proceso de clasificación jerárquica con el método de Ward, hay que tomar la decisión de cual de las tres parejas de grupos unir. Es decir qué unión es la que causa menos incremento en la inercia intra grupos. El incremento de inercia al unir A y B es $I_{AB} - I_A - I_B$. A estos incrementos los llamaremos distancias de Ward entre grupos y la notaremos W, entonces hay que calcular $W(A,B), W(A,C) , W(B,C)$ y la menor de ellas determinará los grupos a unir
Sean A y B dos grupos o clases no vacías y disyuntas y sean $p_A, p_B, g_A, g_B, I_A, I_B$ los pesos, centros de gravedad e inercias de los grupos A y B respectivamente
$$Inercia-entre(A,B)=p_A d^2(g_A, g_{AB})= p_A ||g_A -g_{AB}||^2 + p_B ||g_B - g_{AB}||^2$$Si reemplazamos $g_{AB}= \frac{1}{p_A +p_B} (p_A g_A +p_B g_B)$
$$W(A,B)= \frac{p_A p_B}{p_A + p_B} d^2(g_A, g_B)$$Es el incremento de inercia intra-grupos al unir los grupos A y B
Para los individuos cualesquiera i, l la W es:
$$W(i,l)= \frac{p_i p_l}{p_i +p_l} d^2 (i,l)$$Si los pesos son iguales $1/n$ para los individuos se tiene: $$W(i,l)= \frac{1}{2n} d^2 (i,l)$$
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.graph_objs as go
import warnings
warnings.filterwarnings('ignore')
from sklearn import preprocessing
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from google.colab import drive
import os
drive.mount('/content/gdrive')
# Establecer ruta de acceso en drive
import os
print(os.getcwd())
os.chdir("/content/gdrive/My Drive")
Mounted at /content/gdrive /content
%cd '/content/gdrive/MyDrive/Diplomado Python Análisis y Visualización de Datos/Modulo 5. Aprendizaje No Supervisado'
/content/gdrive/MyDrive/Diplomado Python Análisis y Visualización de Datos/Modulo 5. Aprendizaje No Supervisado
df = pd.read_csv('Mall_Customers.csv')
df.head()
CustomerID | Gender | Age | Annual Income (k$) | Spending Score (1-100) | |
---|---|---|---|---|---|
0 | 1 | Male | 19 | 15 | 39 |
1 | 2 | Male | 21 | 15 | 81 |
2 | 3 | Female | 20 | 16 | 6 |
3 | 4 | Female | 23 | 16 | 77 |
4 | 5 | Female | 31 | 17 | 40 |
df.isnull().sum()
CustomerID 0 Gender 0 Age 0 Annual Income (k$) 0 Spending Score (1-100) 0 dtype: int64
df.describe()
CustomerID | Age | Annual Income (k$) | Spending Score (1-100) | |
---|---|---|---|---|
count | 200.000000 | 200.000000 | 200.000000 | 200.000000 |
mean | 100.500000 | 38.850000 | 60.560000 | 50.200000 |
std | 57.879185 | 13.969007 | 26.264721 | 25.823522 |
min | 1.000000 | 18.000000 | 15.000000 | 1.000000 |
25% | 50.750000 | 28.750000 | 41.500000 | 34.750000 |
50% | 100.500000 | 36.000000 | 61.500000 | 50.000000 |
75% | 150.250000 | 49.000000 | 78.000000 | 73.000000 |
max | 200.000000 | 70.000000 | 137.000000 | 99.000000 |
plt.figure(1 , figsize = (15 , 6))
n = 0
for x in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
n += 1
plt.subplot(1 , 3 , n)
plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
sns.distplot(df[x],bins=30)
plt.title('Histograma de {}'.format(x))
plt.show()
label_encoder = preprocessing.LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender'])
df.head()
CustomerID | Gender | Age | Annual Income (k$) | Spending Score (1-100) | |
---|---|---|---|---|---|
0 | 1 | 1 | 19 | 15 | 39 |
1 | 2 | 1 | 21 | 15 | 81 |
2 | 3 | 0 | 20 | 16 | 6 |
3 | 4 | 0 | 23 | 16 | 77 |
4 | 5 | 0 | 31 | 17 | 40 |
df.shape
(200, 5)
plt.figure(1, figsize = (16 ,8))
sns.heatmap(df[['Age','Annual Income (k$)','Spending Score (1-100)']])
plt.show()
plt.figure( figsize = (12 ,6))
plt.subplot(121)
sns.scatterplot(df.Age,df['Annual Income (k$)'])
plt.subplots_adjust(hspace = 0.3 , wspace = 0.3)
plt.subplot(122)
sns.scatterplot(df['Annual Income (k$)'],df['Spending Score (1-100)'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f36c853ef50>
Los dendrogramas ayudan a mostrar progresiones a medida que se fusionan los grupos
Un dendrograma es un diagrama de ramificación que demuestra cómo se compone cada grupo al ramificarse en sus nodos secundarios.
Los dendrogramas son diagramas de ramificación que muestran la fusión de grupos a medida que nos movemos por la matriz de distancias.
df=df.drop(columns=['CustomerID','Gender'])
df.head()
Age | Annual Income (k$) | Spending Score (1-100) | |
---|---|---|---|
0 | 19 | 15 | 39 |
1 | 21 | 15 | 81 |
2 | 20 | 16 | 6 |
3 | 23 | 16 | 77 |
4 | 31 | 17 | 40 |
plt.figure(1, figsize = (16 ,8))
dendrogram = sch.dendrogram(sch.linkage(df, method = "ward",metric='euclidean',),
orientation='top',# Diferentes formas: right, left, bottom, top
) # Si eligen del metodo de Ward deben usar euclidean
plt.title('Dendrogram')
plt.xlabel('Clientes')
plt.ylabel('Distancias euclideana')
plt.show()
Este es un enfoque "de abajo hacia arriba": cada observación comienza en su propio grupo y los pares de grupos se fusionan a medida que se asciende en la jerarquía.
Atributos importantes
linkage
ward
minimizes the variance of the clusters being merged.
average
uses the average of the distances of each observation of the two sets.
complete
or ‘maximum’ linkage uses the maximum distances between all observations of the two sets.
single
uses the minimum of the distances between all observations of the two sets.
affinity
Metrica usada como criterio de enlace. Puede ser euclidean
, l1
, l2
, manhattan
, cosine
, o precomputed
. Si linkage es ward, solo se acepta euclidean. Si se elige precomputed
, una matriz de distancia se requiere como input para ajustar el metodo
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage ='ward')
y_hc = hc.fit_predict(df)
y_hc
array([4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 0, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 1, 2, 1, 2, 1, 2, 0, 2, 1, 2, 1, 2, 1, 2, 1, 2, 0, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2])
y_hc.shape
(200,)
df['cluster'] = pd.DataFrame(y_hc)
df
Age | Annual Income (k$) | Spending Score (1-100) | cluster | |
---|---|---|---|---|
0 | 19 | 15 | 39 | 4 |
1 | 21 | 15 | 81 | 3 |
2 | 20 | 16 | 6 | 4 |
3 | 23 | 16 | 77 | 3 |
4 | 31 | 17 | 40 | 4 |
... | ... | ... | ... | ... |
195 | 35 | 120 | 79 | 2 |
196 | 45 | 126 | 28 | 1 |
197 | 32 | 126 | 74 | 2 |
198 | 32 | 137 | 18 | 1 |
199 | 30 | 137 | 83 | 2 |
200 rows × 4 columns
trace1 = go.Scatter3d(
x= df['Age'],
y= df['Spending Score (1-100)'],
z= df['Annual Income (k$)'],
mode='markers',
marker=dict(
color = df['cluster'],
size= 10,
line=dict(
color= df['cluster'],
width= 12
),
opacity=0.8
)
)
data = [trace1]
layout = go.Layout(
title= 'Clusters usando tecnica Agglomerative Clustering',
scene = dict(
xaxis = dict(title = 'Edad'),
yaxis = dict(title = 'Spending Score'),
zaxis = dict(title = 'Ingreso anual')
)
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
df.head()
Age | Annual Income (k$) | Spending Score (1-100) | cluster | |
---|---|---|---|---|
0 | 19 | 15 | 39 | 4 |
1 | 21 | 15 | 81 | 3 |
2 | 20 | 16 | 6 | 4 |
3 | 23 | 16 | 77 | 3 |
4 | 31 | 17 | 40 | 4 |
X = df.iloc[:, [1,2]].values
plt.scatter(X[y_hc==0, 0], X[y_hc==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X[y_hc==1, 0], X[y_hc==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X[y_hc==2, 0], X[y_hc==2, 1], s=100, c='green', label ='Cluster 3')
plt.scatter(X[y_hc==3, 0], X[y_hc==3, 1], s=100, c='purple', label ='Cluster 4')
plt.scatter(X[y_hc==4, 0], X[y_hc==4, 1], s=100, c='orange', label ='Cluster 5')
plt.title('Clusters de clientes (Hierarchical Clustering)')
plt.xlabel('Annual Income(k$)')
plt.ylabel('Spending Score(1-100)')
plt.show()
X = df.iloc[:, [0,1]].values
plt.scatter(X[y_hc==0, 0], X[y_hc==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X[y_hc==1, 0], X[y_hc==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X[y_hc==2, 0], X[y_hc==2, 1], s=100, c='green', label ='Cluster 3')
plt.scatter(X[y_hc==3, 0], X[y_hc==3, 1], s=100, c='purple', label ='Cluster 4')
plt.scatter(X[y_hc==4, 0], X[y_hc==4, 1], s=100, c='orange', label ='Cluster 5')
plt.title('Clusters de clientes (Hierarchical Clustering)')
plt.xlabel('Edad')
plt.ylabel('Annual income')
plt.show()
X = df.iloc[:, [2,4]].values
plt.scatter(X[y_hc==0, 0], X[y_hc==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X[y_hc==1, 0], X[y_hc==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X[y_hc==2, 0], X[y_hc==2, 1], s=100, c='green', label ='Cluster 3')
plt.scatter(X[y_hc==3, 0], X[y_hc==3, 1], s=100, c='purple', label ='Cluster 4')
plt.scatter(X[y_hc==4, 0], X[y_hc==4, 1], s=100, c='orange', label ='Cluster 5')
plt.title('Clusters de clientes (Hierarchical Clustering)')
plt.xlabel('Edad')
plt.ylabel('Spending Score')
plt.show()
method
: como calcular la proximidad de los clustersmetric
: metrica de distanciaoptimal_ordering
: Orden de los puntosImagine que un fabricante de automóviles ha desarrollado prototipos para un vehículo nuevo. Antes de introducir el nuevo modelo en su gama, el fabricante quiere determinar qué vehículos existentes en el mercado se parecen más a los prototipos, es decir, cómo se pueden agrupar los vehículos, qué grupo es el más similar al modelo y, por tanto, qué modelos. estarán compitiendo contra ellos.
Nuestro objetivo aquí es utilizar métodos de agrupamiento para encontrar los grupos de vehículos más distintivos. Resumirá los vehículos existentes y ayudará a la fabricación a tomar decisiones sobre nuevos modelos de forma sencilla.
Para descargar los datos, usaremos !wget
. Estos datos se encuentran alojado en una API: IBM Object Storage.
__ ¿Lo sabían? __ Cuando se trata de aprendizaje automático, es probable que trabaje con grandes conjuntos de datos. Como empresa, ¿dónde puede alojar sus datos? IBM ofrece una oportunidad única para las empresas, con 10 Tb de IBM Cloud Object Storage: Pueden registrarse aqui
!wget -O cars_clus.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cars_clus.csv
--2021-09-21 02:53:27-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cars_clus.csv Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196 Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 17774 (17K) [text/csv] Saving to: ‘cars_clus.csv’ cars_clus.csv 100%[===================>] 17.36K --.-KB/s in 0.01s 2021-09-21 02:53:27 (1.42 MB/s) - ‘cars_clus.csv’ saved [17774/17774]
filename = 'cars_clus.csv'
#Lectura
pdf = pd.read_csv(filename)
print ("Shape: ", pdf.shape)
pdf.head(5)
Shape: (159, 16)
manufact | model | sales | resale | type | price | engine_s | horsepow | wheelbas | width | length | curb_wgt | fuel_cap | mpg | lnsales | partition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Acura | Integra | 16.919 | 16.360 | 0.000 | 21.500 | 1.800 | 140.000 | 101.200 | 67.300 | 172.400 | 2.639 | 13.200 | 28.000 | 2.828 | 0.0 |
1 | Acura | TL | 39.384 | 19.875 | 0.000 | 28.400 | 3.200 | 225.000 | 108.100 | 70.300 | 192.900 | 3.517 | 17.200 | 25.000 | 3.673 | 0.0 |
2 | Acura | CL | 14.114 | 18.225 | 0.000 | $null$ | 3.200 | 225.000 | 106.900 | 70.600 | 192.000 | 3.470 | 17.200 | 26.000 | 2.647 | 0.0 |
3 | Acura | RL | 8.588 | 29.725 | 0.000 | 42.000 | 3.500 | 210.000 | 114.600 | 71.400 | 196.600 | 3.850 | 18.000 | 22.000 | 2.150 | 0.0 |
4 | Audi | A4 | 20.397 | 22.255 | 0.000 | 23.990 | 1.800 | 150.000 | 102.600 | 68.200 | 178.000 | 2.998 | 16.400 | 27.000 | 3.015 | 0.0 |
Los features incluyen:
print ("Shape antes de cleaning: ", pdf.shape)
pdf[[ 'sales', 'resale', 'type', 'price', 'engine_s',
'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
'mpg', 'lnsales']] = pdf[['sales', 'resale', 'type', 'price', 'engine_s',
'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
'mpg', 'lnsales']].apply(pd.to_numeric, errors='coerce')
pdf = pdf.dropna()
pdf = pdf.reset_index(drop=True)
print ("Shape despues de cleaning: ", pdf.shape)
pdf.head(5)
Shape antes de cleaning: (159, 16) Shape despues de cleaning: (117, 16)
manufact | model | sales | resale | type | price | engine_s | horsepow | wheelbas | width | length | curb_wgt | fuel_cap | mpg | lnsales | partition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Acura | Integra | 16.919 | 16.360 | 0.0 | 21.50 | 1.8 | 140.0 | 101.2 | 67.3 | 172.4 | 2.639 | 13.2 | 28.0 | 2.828 | 0.0 |
1 | Acura | TL | 39.384 | 19.875 | 0.0 | 28.40 | 3.2 | 225.0 | 108.1 | 70.3 | 192.9 | 3.517 | 17.2 | 25.0 | 3.673 | 0.0 |
2 | Acura | RL | 8.588 | 29.725 | 0.0 | 42.00 | 3.5 | 210.0 | 114.6 | 71.4 | 196.6 | 3.850 | 18.0 | 22.0 | 2.150 | 0.0 |
3 | Audi | A4 | 20.397 | 22.255 | 0.0 | 23.99 | 1.8 | 150.0 | 102.6 | 68.2 | 178.0 | 2.998 | 16.4 | 27.0 | 3.015 | 0.0 |
4 | Audi | A6 | 18.780 | 23.555 | 0.0 | 33.95 | 2.8 | 200.0 | 108.7 | 76.1 | 192.0 | 3.561 | 18.5 | 22.0 | 2.933 | 0.0 |
featureset = pdf[['engine_s', 'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap', 'mpg']]
featureset
engine_s | horsepow | wheelbas | width | length | curb_wgt | fuel_cap | mpg | |
---|---|---|---|---|---|---|---|---|
0 | 1.8 | 140.0 | 101.2 | 67.3 | 172.4 | 2.639 | 13.2 | 28.0 |
1 | 3.2 | 225.0 | 108.1 | 70.3 | 192.9 | 3.517 | 17.2 | 25.0 |
2 | 3.5 | 210.0 | 114.6 | 71.4 | 196.6 | 3.850 | 18.0 | 22.0 |
3 | 1.8 | 150.0 | 102.6 | 68.2 | 178.0 | 2.998 | 16.4 | 27.0 |
4 | 2.8 | 200.0 | 108.7 | 76.1 | 192.0 | 3.561 | 18.5 | 22.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
112 | 2.0 | 115.0 | 98.9 | 68.3 | 163.3 | 2.767 | 14.5 | 26.0 |
113 | 2.0 | 115.0 | 98.9 | 68.3 | 172.3 | 2.853 | 14.5 | 26.0 |
114 | 1.8 | 150.0 | 106.4 | 68.5 | 184.1 | 3.043 | 16.4 | 27.0 |
115 | 2.0 | 115.0 | 97.4 | 66.7 | 160.4 | 3.079 | 13.7 | 26.0 |
116 | 2.0 | 115.0 | 98.9 | 68.3 | 163.3 | 2.762 | 14.6 | 26.0 |
117 rows × 8 columns
En esta parte usamos el paquete Scipy para agrupar el conjunto de datos: Primero, calculamos la matriz de distancias.
feature_mtx=featureset.values
feature_mtx
array([[ 1.8 , 140. , 101.2 , 67.3 , 172.4 , 2.639, 13.2 , 28. ], [ 3.2 , 225. , 108.1 , 70.3 , 192.9 , 3.517, 17.2 , 25. ], [ 3.5 , 210. , 114.6 , 71.4 , 196.6 , 3.85 , 18. , 22. ], [ 1.8 , 150. , 102.6 , 68.2 , 178. , 2.998, 16.4 , 27. ], [ 2.8 , 200. , 108.7 , 76.1 , 192. , 3.561, 18.5 , 22. ], [ 4.2 , 310. , 113. , 74. , 198.2 , 3.902, 23.7 , 21. ], [ 2.8 , 193. , 107.3 , 68.5 , 176. , 3.197, 16.6 , 24. ], [ 2.8 , 193. , 111.4 , 70.9 , 188. , 3.472, 18.5 , 24.8 ], [ 3.1 , 175. , 109. , 72.7 , 194.6 , 3.368, 17.5 , 25. ], [ 3.8 , 240. , 109. , 72.7 , 196.2 , 3.543, 17.5 , 23. ], [ 3.8 , 205. , 113.8 , 74.7 , 206.8 , 3.778, 18.5 , 24. ], [ 3.8 , 205. , 112.2 , 73.5 , 200. , 3.591, 17.5 , 25. ], [ 4.6 , 275. , 115.3 , 74.5 , 207.2 , 3.978, 18.5 , 22. ], [ 4.6 , 275. , 108. , 75.5 , 200.6 , 3.843, 19. , 22. ], [ 3. , 200. , 107.4 , 70.3 , 194.8 , 3.77 , 18. , 22. ], [ 2.2 , 115. , 104.1 , 67.9 , 180.9 , 2.676, 14.3 , 27. ], [ 3.1 , 170. , 107. , 69.4 , 190.4 , 3.051, 15. , 25. ], [ 3.1 , 175. , 107.5 , 72.5 , 200.9 , 3.33 , 16.6 , 25. ], [ 3.4 , 180. , 110.5 , 72.7 , 197.9 , 3.34 , 17. , 27. ], [ 3.8 , 200. , 101.1 , 74.1 , 193.2 , 3.5 , 16.8 , 25. ], [ 5.7 , 345. , 104.5 , 73.6 , 179.7 , 3.21 , 19.1 , 22. ], [ 1.8 , 120. , 97.1 , 66.7 , 174.3 , 2.398, 13.2 , 33. ], [ 1. , 55. , 93.1 , 62.6 , 149.4 , 1.895, 10.3 , 45. ], [ 2.5 , 163. , 103.7 , 69.7 , 190.9 , 2.967, 15.9 , 24. ], [ 2.5 , 168. , 106. , 69.2 , 193. , 3.332, 16. , 24. ], [ 2.7 , 200. , 113. , 74.4 , 209.1 , 3.452, 17. , 26. ], [ 2. , 132. , 108. , 71. , 186. , 2.911, 16. , 27. ], [ 3.5 , 253. , 113. , 74.4 , 207.7 , 3.564, 17. , 23. ], [ 2. , 132. , 105. , 74.4 , 174.4 , 2.567, 12.5 , 29. ], [ 2.5 , 163. , 103.7 , 69.1 , 190.2 , 2.879, 15.9 , 24. ], [ 2.5 , 168. , 108. , 71. , 186. , 3.058, 16. , 24. ], [ 8. , 450. , 96.2 , 75.7 , 176.7 , 3.375, 19. , 16. ], [ 5.2 , 230. , 138.7 , 79.3 , 224.2 , 4.47 , 26. , 17. ], [ 3.9 , 175. , 109.6 , 78.8 , 192.6 , 4.245, 32. , 15. ], [ 3.9 , 175. , 127.2 , 78.8 , 208.5 , 4.298, 32. , 16. ], [ 2.5 , 120. , 131. , 71.5 , 215. , 3.557, 22. , 19. ], [ 2.4 , 150. , 113.3 , 76.8 , 186.3 , 3.533, 20. , 24. ], [ 2. , 110. , 98.4 , 67. , 174.7 , 2.468, 12.7 , 30. ], [ 3.8 , 190. , 101.3 , 73.1 , 183.2 , 3.203, 15.7 , 24. ], [ 2.5 , 170. , 106.5 , 69.1 , 184.6 , 2.769, 15. , 25. ], [ 3. , 155. , 108.5 , 73. , 197.6 , 3.368, 16. , 24. ], [ 4.6 , 200. , 114.7 , 78.2 , 212. , 3.908, 19. , 21. ], [ 4. , 210. , 111.6 , 70.2 , 190.7 , 3.876, 21. , 19. ], [ 3. , 150. , 120.7 , 76.6 , 200.9 , 3.761, 26. , 21. ], [ 4.6 , 240. , 119. , 78.7 , 204.6 , 4.808, 26. , 16. ], [ 2.5 , 119. , 117.5 , 69.4 , 200.7 , 3.086, 20. , 23. ], [ 4.6 , 220. , 138.5 , 79.1 , 224.5 , 4.241, 25.1 , 18. ], [ 1.6 , 106. , 103.2 , 67.1 , 175.1 , 2.339, 11.9 , 32. ], [ 2.3 , 135. , 106.9 , 70.3 , 188.8 , 2.932, 17.1 , 27. ], [ 2. , 146. , 103.2 , 68.9 , 177.6 , 3.219, 15.3 , 24. ], [ 3.2 , 205. , 106.4 , 70.4 , 178.2 , 3.857, 21.1 , 19. ], [ 3.5 , 210. , 118.1 , 75.6 , 201.2 , 4.288, 20. , 23. ], [ 1.5 , 92. , 96.1 , 65.7 , 166.7 , 2.24 , 11.9 , 31. ], [ 2. , 140. , 100.4 , 66.9 , 174. , 2.626, 14.5 , 27. ], [ 2.4 , 148. , 106.3 , 71.6 , 185.4 , 3.072, 17.2 , 25. ], [ 3. , 227. , 108.3 , 70.2 , 193.7 , 3.342, 18.5 , 25. ], [ 2.5 , 120. , 93.4 , 66.7 , 152. , 3.045, 19. , 17. ], [ 4. , 190. , 101.4 , 69.4 , 167.5 , 3.194, 20. , 20. ], [ 4. , 195. , 105.9 , 72.3 , 181.5 , 3.88 , 20.5 , 19. ], [ 3. , 210. , 105.1 , 70.5 , 190.2 , 3.373, 18.5 , 23. ], [ 3. , 225. , 110.2 , 70.9 , 189.2 , 3.638, 19.8 , 23. ], [ 4. , 290. , 112.2 , 72. , 196.7 , 3.89 , 22.5 , 22. ], [ 4.6 , 275. , 109. , 73.6 , 208.5 , 3.868, 20. , 22. ], [ 4.6 , 215. , 117.7 , 78.2 , 215.3 , 4.121, 19. , 21. ], [ 1.8 , 113. , 98.4 , 66.5 , 173.6 , 2.25 , 13.2 , 30. ], [ 2.4 , 154. , 100.8 , 68.9 , 175.4 , 2.91 , 15.9 , 24. ], [ 2.4 , 145. , 103.7 , 68.5 , 187.8 , 2.945, 16.3 , 25. ], [ 3.5 , 210. , 107.1 , 70.3 , 194.1 , 3.443, 19. , 22. ], [ 3. , 161. , 97.2 , 72.4 , 180.3 , 3.131, 19.8 , 21. ], [ 3.5 , 200. , 107.3 , 69.9 , 186.6 , 4.52 , 24.3 , 18. ], [ 3. , 173. , 107.3 , 66.7 , 178.3 , 3.51 , 19.5 , 20. ], [ 2. , 125. , 106.5 , 69.1 , 184.8 , 2.769, 15. , 28. ], [ 2. , 125. , 106.4 , 69.6 , 185. , 2.892, 16. , 30. ], [ 3. , 153. , 108.5 , 73. , 199.7 , 3.379, 16. , 24. ], [ 4.6 , 200. , 114.7 , 78.2 , 212. , 3.958, 19. , 21. ], [ 4. , 210. , 111.6 , 70.2 , 190.1 , 3.876, 21. , 18. ], [ 3.3 , 170. , 112.2 , 74.9 , 194.7 , 3.944, 20. , 21. ], [ 2.3 , 185. , 105.9 , 67.7 , 177.4 , 3.25 , 16.4 , 26. ], [ 3.2 , 221. , 111.5 , 70.8 , 189.4 , 3.823, 21.1 , 25. ], [ 4.3 , 275. , 121.5 , 73.1 , 203.1 , 4.133, 23.2 , 21. ], [ 5. , 302. , 99. , 71.3 , 177.1 , 4.125, 21.1 , 20. ], [ 1.8 , 126. , 99.8 , 67.3 , 177.5 , 2.593, 13.2 , 30. ], [ 2.4 , 155. , 103.1 , 69.1 , 183.5 , 3.012, 15.9 , 25. ], [ 3. , 222. , 108.3 , 70.3 , 190.5 , 3.294, 18.5 , 25. ], [ 3.3 , 170. , 112.2 , 74.9 , 194.8 , 3.991, 20. , 21. ], [ 3.3 , 170. , 106.3 , 71.7 , 182.6 , 3.947, 21. , 19. ], [ 3.1 , 150. , 107. , 69.4 , 192. , 3.102, 15.2 , 25. ], [ 4. , 250. , 113.8 , 74.4 , 205.4 , 3.967, 18.5 , 22. ], [ 4.3 , 190. , 107. , 67.8 , 181.2 , 4.068, 17.5 , 19. ], [ 3.4 , 185. , 120. , 72.2 , 201.4 , 3.948, 25. , 22. ], [ 2. , 132. , 105. , 74.4 , 174.4 , 2.559, 12.5 , 29. ], [ 2. , 132. , 108. , 71. , 186.3 , 2.942, 16. , 27. ], [ 2.4 , 150. , 113.3 , 76.8 , 186.3 , 3.528, 20. , 24. ], [ 2.4 , 150. , 104.1 , 68.4 , 181.9 , 2.906, 15. , 27. ], [ 3.4 , 175. , 107. , 70.4 , 186.3 , 3.091, 15.2 , 25. ], [ 3.8 , 200. , 101.1 , 74.5 , 193.4 , 3.492, 16.8 , 25. ], [ 3.8 , 195. , 110.5 , 72.7 , 196.5 , 3.396, 18. , 25. ], [ 3.8 , 205. , 112.2 , 72.6 , 202.5 , 3.59 , 17.5 , 24. ], [ 2.7 , 217. , 95.2 , 70.1 , 171. , 2.778, 17. , 22. ], [ 3.4 , 300. , 92.6 , 69.5 , 174.5 , 3.032, 17. , 21. ], [ 3.4 , 300. , 92.6 , 69.5 , 174.5 , 3.075, 17. , 23. ], [ 1.9 , 100. , 102.4 , 66.4 , 176.9 , 2.332, 12.1 , 33. ], [ 1.9 , 100. , 102.4 , 66.4 , 180. , 2.367, 12.1 , 33. ], [ 1.9 , 124. , 102.4 , 66.4 , 176.9 , 2.452, 12.1 , 31. ], [ 1.8 , 120. , 97. , 66.7 , 174. , 2.42 , 13.2 , 33. ], [ 2.2 , 133. , 105.2 , 70.1 , 188.5 , 2.998, 18.5 , 27. ], [ 3. , 210. , 107.1 , 71.7 , 191.9 , 3.417, 18.5 , 26. ], [ 1.8 , 140. , 102.4 , 68.3 , 170.5 , 2.425, 14.5 , 31. ], [ 2.4 , 142. , 103.3 , 66.5 , 178.7 , 2.58 , 15.1 , 23. ], [ 2. , 127. , 94.9 , 66.7 , 163.8 , 2.668, 15.3 , 27. ], [ 2.7 , 150. , 105.3 , 66.5 , 183.3 , 3.44 , 18.5 , 23. ], [ 4.7 , 230. , 112.2 , 76.4 , 192.5 , 5.115, 25.4 , 15. ], [ 2. , 115. , 98.9 , 68.3 , 163.3 , 2.767, 14.5 , 26. ], [ 2. , 115. , 98.9 , 68.3 , 172.3 , 2.853, 14.5 , 26. ], [ 1.8 , 150. , 106.4 , 68.5 , 184.1 , 3.043, 16.4 , 27. ], [ 2. , 115. , 97.4 , 66.7 , 160.4 , 3.079, 13.7 , 26. ], [ 2. , 115. , 98.9 , 68.3 , 163.3 , 2.762, 14.6 , 26. ]])
feature_mtx.shape
(117, 8)
import scipy
leng = feature_mtx.shape[0]
D = scipy.zeros([leng,leng])
for i in range(leng):
for j in range(leng):
D[i,j] = scipy.spatial.distance.euclidean(feature_mtx[i], feature_mtx[j])
D.shape
(117, 117)
D
array([[ 0. , 87.9180919 , 75.79845989, ..., 16.63650252, 28.07638866, 26.83496095], [ 87.9180919 , 0. , 17.0877409 , ..., 75.60022934, 115.3194773 , 114.3440861 ], [ 75.79845989, 17.0877409 , 0. , ..., 62.15304698, 103.39586278, 102.0832197 ], ..., [ 16.63650252, 75.60022934, 62.15304698, ..., 0. , 43.35044747, 41.45224917], [ 28.07638866, 115.3194773 , 103.39586278, ..., 43.35044747, 0. , 3.75905427], [ 26.83496095, 114.3440861 , 102.0832197 , ..., 41.45224917, 3.75905427, 0. ]])
En la agrupación aglomerativa, en cada iteración, el algoritmo debe actualizar la matriz de distancia para reflejar la distancia del grupo recién formado con los grupos restantes en el bosque.
Los siguientes métodos son compatibles con Scipy para calcular la distancia entre el grupo recién formado y cada uno:
- single
- complete
- average
- weighted
- centroid
Usamos complete para nuestro caso, pero siéntanse libre de cambiarlo para ver cómo cambian los resultados.
import pylab
import scipy.cluster.hierarchy
Z = scipy.cluster.hierarchy.linkage(D, 'complete')
Z
array([[3.60000000e+01, 9.20000000e+01, 7.15323980e-03, 2.00000000e+00], [2.80000000e+01, 9.00000000e+01, 1.14125439e-02, 2.00000000e+00], [4.10000000e+01, 7.40000000e+01, 7.12297939e-02, 2.00000000e+00], [1.12000000e+02, 1.16000000e+02, 1.64885417e-01, 2.00000000e+00], [7.60000000e+01, 8.40000000e+01, 4.36406293e-01, 2.00000000e+00], [2.60000000e+01, 9.10000000e+01, 1.23189811e+00, 2.00000000e+00], [1.90000000e+01, 9.50000000e+01, 1.31306447e+00, 2.00000000e+00], [2.10000000e+01, 1.04000000e+02, 1.38862974e+00, 2.00000000e+00], [9.90000000e+01, 1.00000000e+02, 2.94422151e+00, 2.00000000e+00], [2.30000000e+01, 2.90000000e+01, 3.36006180e+00, 2.00000000e+00], [4.20000000e+01, 7.50000000e+01, 3.77893264e+00, 2.00000000e+00], [7.10000000e+01, 7.20000000e+01, 4.00362228e+00, 2.00000000e+00], [0.00000000e+00, 5.30000000e+01, 8.52846641e+00, 2.00000000e+00], [1.01000000e+02, 1.02000000e+02, 9.50746399e+00, 2.00000000e+00], [1.10000000e+02, 1.14000000e+02, 1.08028907e+01, 2.00000000e+00], [6.70000000e+01, 1.06000000e+02, 1.12146560e+01, 2.00000000e+00], [4.00000000e+00, 1.40000000e+01, 1.14975491e+01, 2.00000000e+00], [1.10000000e+01, 9.70000000e+01, 1.15510839e+01, 2.00000000e+00], [1.20000000e+01, 6.20000000e+01, 1.20315870e+01, 2.00000000e+00], [7.80000000e+01, 8.30000000e+01, 1.22451683e+01, 2.00000000e+00], [1.00000000e+00, 6.00000000e+01, 1.23384003e+01, 2.00000000e+00], [9.30000000e+01, 1.31000000e+02, 1.50518952e+01, 3.00000000e+00], [1.05000000e+02, 1.22000000e+02, 1.55116196e+01, 3.00000000e+00], [3.00000000e+01, 3.90000000e+01, 1.82131708e+01, 2.00000000e+00], [1.60000000e+01, 2.40000000e+01, 1.82260617e+01, 2.00000000e+00], [3.80000000e+01, 8.80000000e+01, 1.85711210e+01, 2.00000000e+00], [5.90000000e+01, 1.32000000e+02, 1.85919980e+01, 3.00000000e+00], [1.07000000e+02, 1.29000000e+02, 1.90007757e+01, 3.00000000e+00], [1.15000000e+02, 1.20000000e+02, 1.99923984e+01, 3.00000000e+00], [4.00000000e+01, 7.30000000e+01, 2.00108220e+01, 2.00000000e+00], [8.10000000e+01, 1.03000000e+02, 2.05792970e+01, 2.00000000e+00], [1.27000000e+02, 1.43000000e+02, 2.27889851e+01, 5.00000000e+00], [7.90000000e+01, 1.35000000e+02, 2.28694261e+01, 3.00000000e+00], [5.40000000e+01, 1.38000000e+02, 2.31406852e+01, 4.00000000e+00], [6.40000000e+01, 1.13000000e+02, 2.39614987e+01, 2.00000000e+00], [8.00000000e+01, 1.25000000e+02, 2.47504218e+01, 3.00000000e+00], [1.23000000e+02, 1.33000000e+02, 2.52052456e+01, 4.00000000e+00], [5.50000000e+01, 1.37000000e+02, 2.56583890e+01, 3.00000000e+00], [8.00000000e+00, 1.70000000e+01, 2.79871287e+01, 2.00000000e+00], [8.60000000e+01, 1.17000000e+02, 2.94217051e+01, 3.00000000e+00], [2.50000000e+01, 1.19000000e+02, 2.94392141e+01, 3.00000000e+00], [8.50000000e+01, 1.40000000e+02, 3.01296762e+01, 3.00000000e+00], [1.30000000e+01, 1.49000000e+02, 3.23304217e+01, 4.00000000e+00], [4.80000000e+01, 1.39000000e+02, 3.30523018e+01, 4.00000000e+00], [2.00000000e+00, 5.10000000e+01, 3.36196852e+01, 2.00000000e+00], [2.70000000e+01, 8.70000000e+01, 3.50006201e+01, 2.00000000e+00], [3.00000000e+00, 4.90000000e+01, 3.60698237e+01, 2.00000000e+00], [6.00000000e+00, 1.42000000e+02, 3.77471322e+01, 3.00000000e+00], [6.90000000e+01, 1.53000000e+02, 3.82515187e+01, 5.00000000e+00], [1.00000000e+01, 1.34000000e+02, 3.86593095e+01, 3.00000000e+00], [7.00000000e+00, 5.80000000e+01, 3.89679992e+01, 2.00000000e+00], [3.70000000e+01, 4.70000000e+01, 3.97217277e+01, 2.00000000e+00], [1.41000000e+02, 1.58000000e+02, 4.02221159e+01, 5.00000000e+00], [1.50000000e+01, 1.24000000e+02, 4.19953062e+01, 3.00000000e+00], [6.50000000e+01, 8.20000000e+01, 4.35855112e+01, 2.00000000e+00], [6.60000000e+01, 1.50000000e+02, 4.50039145e+01, 5.00000000e+00], [7.00000000e+01, 9.40000000e+01, 4.55005025e+01, 2.00000000e+00], [1.28000000e+02, 1.47000000e+02, 4.95386226e+01, 4.00000000e+00], [1.21000000e+02, 1.55000000e+02, 5.38695790e+01, 4.00000000e+00], [1.08000000e+02, 1.44000000e+02, 5.71562000e+01, 4.00000000e+00], [1.63000000e+02, 1.72000000e+02, 5.77658979e+01, 7.00000000e+00], [9.00000000e+00, 4.40000000e+01, 5.91698283e+01, 2.00000000e+00], [1.45000000e+02, 1.51000000e+02, 6.03837592e+01, 5.00000000e+00], [6.80000000e+01, 1.26000000e+02, 6.04112506e+01, 3.00000000e+00], [1.64000000e+02, 1.67000000e+02, 6.04814058e+01, 5.00000000e+00], [1.61000000e+02, 1.66000000e+02, 6.05308171e+01, 5.00000000e+00], [9.60000000e+01, 1.65000000e+02, 6.05862147e+01, 6.00000000e+00], [1.36000000e+02, 1.54000000e+02, 6.14147340e+01, 5.00000000e+00], [1.56000000e+02, 1.71000000e+02, 6.50522978e+01, 5.00000000e+00], [1.69000000e+02, 1.73000000e+02, 6.55089578e+01, 7.00000000e+00], [1.80000000e+01, 3.30000000e+01, 6.81767329e+01, 2.00000000e+00], [5.70000000e+01, 7.70000000e+01, 7.51732588e+01, 2.00000000e+00], [1.18000000e+02, 1.60000000e+02, 7.85878375e+01, 6.00000000e+00], [3.20000000e+01, 4.60000000e+01, 7.92986870e+01, 2.00000000e+00], [1.48000000e+02, 1.82000000e+02, 8.15119120e+01, 1.00000000e+01], [4.30000000e+01, 1.46000000e+02, 8.44417218e+01, 3.00000000e+00], [1.68000000e+02, 1.79000000e+02, 8.77561179e+01, 7.00000000e+00], [1.09000000e+02, 1.70000000e+02, 8.81280429e+01, 4.00000000e+00], [1.75000000e+02, 1.87000000e+02, 9.30459850e+01, 6.00000000e+00], [1.77000000e+02, 1.85000000e+02, 9.66864949e+01, 1.20000000e+01], [1.74000000e+02, 1.94000000e+02, 1.04818186e+02, 8.00000000e+00], [1.11000000e+02, 1.84000000e+02, 1.06553419e+02, 6.00000000e+00], [1.76000000e+02, 1.89000000e+02, 1.07164506e+02, 1.00000000e+01], [5.00000000e+01, 1.83000000e+02, 1.07589482e+02, 7.00000000e+00], [8.90000000e+01, 1.81000000e+02, 1.09240248e+02, 6.00000000e+00], [5.60000000e+01, 1.93000000e+02, 1.09607978e+02, 8.00000000e+00], [5.20000000e+01, 1.30000000e+02, 1.14496651e+02, 3.00000000e+00], [1.57000000e+02, 1.91000000e+02, 1.15015150e+02, 1.30000000e+01], [5.00000000e+00, 1.52000000e+02, 1.17446027e+02, 4.00000000e+00], [6.30000000e+01, 1.98000000e+02, 1.24112665e+02, 7.00000000e+00], [3.50000000e+01, 4.50000000e+01, 1.24267882e+02, 2.00000000e+00], [1.88000000e+02, 2.01000000e+02, 1.31988781e+02, 8.00000000e+00], [1.86000000e+02, 1.95000000e+02, 1.38267377e+02, 1.30000000e+01], [1.80000000e+02, 1.92000000e+02, 1.42225875e+02, 6.00000000e+00], [6.10000000e+01, 1.59000000e+02, 1.54773144e+02, 5.00000000e+00], [2.00000000e+02, 2.04000000e+02, 1.55967351e+02, 2.00000000e+01], [1.62000000e+02, 1.78000000e+02, 1.56069020e+02, 4.00000000e+00], [9.80000000e+01, 2.06000000e+02, 1.58875971e+02, 8.00000000e+00], [1.96000000e+02, 2.10000000e+02, 1.85849610e+02, 1.80000000e+01], [3.40000000e+01, 2.08000000e+02, 1.93383914e+02, 9.00000000e+00], [1.97000000e+02, 2.02000000e+02, 2.08133233e+02, 1.60000000e+01], [1.90000000e+02, 2.13000000e+02, 2.30571072e+02, 6.00000000e+00], [2.09000000e+02, 2.16000000e+02, 2.49917484e+02, 2.20000000e+01], [2.07000000e+02, 2.17000000e+02, 2.52297061e+02, 1.80000000e+01], [1.99000000e+02, 2.15000000e+02, 3.12353684e+02, 2.80000000e+01], [2.12000000e+02, 2.14000000e+02, 3.26867599e+02, 2.80000000e+01], [2.05000000e+02, 2.11000000e+02, 3.64191387e+02, 9.00000000e+00], [2.03000000e+02, 2.20000000e+02, 3.65288634e+02, 2.10000000e+01], [2.19000000e+02, 2.21000000e+02, 5.40817151e+02, 5.00000000e+01], [2.18000000e+02, 2.22000000e+02, 5.53975777e+02, 3.40000000e+01], [2.00000000e+01, 2.23000000e+02, 7.32101455e+02, 1.00000000e+01], [2.20000000e+01, 2.24000000e+02, 7.98352012e+02, 2.20000000e+01], [2.25000000e+02, 2.26000000e+02, 9.04829030e+02, 8.40000000e+01], [2.28000000e+02, 2.29000000e+02, 1.17807448e+03, 1.06000000e+02], [2.27000000e+02, 2.30000000e+02, 1.57835872e+03, 1.16000000e+02], [3.10000000e+01, 2.31000000e+02, 2.58674263e+03, 1.17000000e+02]])
Esencialmente, la agrupación jerárquica no requiere un número predeterminado de agrupaciones. Sin embargo, en algunas aplicaciones queremos una partición de clústeres disjuntos como en un clúster plano.
Entonces podemos usar una línea de corte:
from scipy.cluster.hierarchy import fcluster
max_d = 3
clusters = fcluster(Z, max_d, criterion='distance')
clusters
array([ 51, 102, 95, 59, 83, 3, 46, 47, 38, 80, 99, 97, 4, 7, 84, 20, 30, 39, 40, 82, 9, 19, 29, 70, 31, 89, 56, 78, 55, 71, 32, 108, 76, 41, 50, 13, 66, 22, 44, 33, 73, 88, 90, 75, 81, 14, 77, 23, 58, 60, 87, 96, 12, 52, 64, 104, 28, 42, 48, 94, 103, 8, 5, 106, 26, 68, 65, 92, 72, 85, 35, 15, 16, 74, 88, 91, 37, 43, 100, 6, 2, 17, 69, 101, 37, 34, 67, 79, 45, 49, 55, 56, 66, 63, 36, 82, 86, 98, 107, 1, 1, 10, 11, 18, 19, 57, 93, 53, 54, 21, 61, 105, 24, 27, 62, 25, 24], dtype=int32)
# Determinar clusters
from scipy.cluster.hierarchy import fcluster
k = 5
clusters = fcluster(Z, k, criterion='maxclust')
clusters
array([3, 4, 4, 3, 4, 1, 3, 3, 3, 4, 4, 4, 1, 1, 4, 2, 3, 3, 3, 4, 1, 2, 2, 3, 3, 4, 3, 4, 3, 3, 3, 5, 4, 3, 3, 2, 3, 2, 3, 3, 3, 4, 4, 3, 4, 2, 4, 2, 3, 3, 4, 4, 2, 3, 3, 4, 2, 3, 3, 4, 4, 1, 1, 4, 2, 3, 3, 4, 3, 4, 3, 2, 2, 3, 4, 4, 3, 3, 4, 1, 1, 2, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 1, 1, 2, 2, 2, 2, 3, 4, 3, 3, 2, 3, 4, 2, 2, 3, 2, 2], dtype=int32)
fig = pylab.figure(figsize=(18,50))
def llf(id):
return '[%s %s %s]' % (pdf['manufact'][id], pdf['model'][id], int(float(pdf['type'][id])) )
dendro = scipy.cluster.hierarchy.dendrogram(Z, leaf_label_func=llf, leaf_rotation=0, leaf_font_size =12, orientation = 'top')
from scipy.spatial import distance_matrix
dist_matrix = distance_matrix(feature_mtx,feature_mtx)
print(dist_matrix)
[[ 0. 87.9180919 75.79845989 ... 16.63650252 28.07638866 26.83496095] [ 87.9180919 0. 17.0877409 ... 75.60022934 115.3194773 114.3440861 ] [ 75.79845989 17.0877409 0. ... 62.15304698 103.39586278 102.0832197 ] ... [ 16.63650252 75.60022934 62.15304698 ... 0. 43.35044747 41.45224917] [ 28.07638866 115.3194773 103.39586278 ... 43.35044747 0. 3.75905427] [ 26.83496095 114.3440861 102.0832197 ... 41.45224917 3.75905427 0. ]]
agglom = AgglomerativeClustering(n_clusters = 6, linkage = 'complete')
agglom.fit(feature_mtx)
agglom.labels_
array([2, 0, 0, 3, 0, 1, 0, 0, 3, 0, 0, 0, 1, 1, 0, 2, 3, 3, 3, 0, 1, 2, 4, 3, 3, 0, 2, 0, 2, 3, 3, 5, 0, 3, 0, 3, 3, 2, 0, 3, 3, 0, 0, 3, 0, 3, 0, 2, 2, 2, 0, 0, 2, 2, 3, 0, 2, 0, 0, 0, 0, 1, 1, 0, 2, 3, 3, 0, 3, 0, 3, 2, 2, 3, 0, 0, 3, 0, 0, 1, 1, 2, 3, 0, 3, 3, 3, 0, 0, 0, 2, 2, 3, 3, 3, 0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 0, 2, 2, 2, 3, 0, 2, 2, 3, 2, 2])
pdf['cluster_'] = agglom.labels_
pdf.head()
manufact | model | sales | resale | type | price | engine_s | horsepow | wheelbas | width | length | curb_wgt | fuel_cap | mpg | lnsales | partition | cluster_ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Acura | Integra | 16.919 | 16.360 | 0.0 | 21.50 | 1.8 | 140.0 | 101.2 | 67.3 | 172.4 | 2.639 | 13.2 | 28.0 | 2.828 | 0.0 | 2 |
1 | Acura | TL | 39.384 | 19.875 | 0.0 | 28.40 | 3.2 | 225.0 | 108.1 | 70.3 | 192.9 | 3.517 | 17.2 | 25.0 | 3.673 | 0.0 | 0 |
2 | Acura | RL | 8.588 | 29.725 | 0.0 | 42.00 | 3.5 | 210.0 | 114.6 | 71.4 | 196.6 | 3.850 | 18.0 | 22.0 | 2.150 | 0.0 | 0 |
3 | Audi | A4 | 20.397 | 22.255 | 0.0 | 23.99 | 1.8 | 150.0 | 102.6 | 68.2 | 178.0 | 2.998 | 16.4 | 27.0 | 3.015 | 0.0 | 3 |
4 | Audi | A6 | 18.780 | 23.555 | 0.0 | 33.95 | 2.8 | 200.0 | 108.7 | 76.1 | 192.0 | 3.561 | 18.5 | 22.0 | 2.933 | 0.0 | 0 |
import matplotlib.cm as cm
n_clusters = max(agglom.labels_)+1
colors = cm.rainbow(np.linspace(0, 1, n_clusters))
cluster_labels = list(range(0, n_clusters))
# Figura de tamaño 16 inches por 14 inches.
plt.figure(figsize=(16,14))
for color, label in zip(colors, cluster_labels):
subset = pdf[pdf.cluster_ == label]
for i in subset.index:
plt.text(subset.horsepow[i], subset.mpg[i],str(subset['model'][i]), rotation=25)
plt.scatter(subset.horsepow, subset.mpg, s= subset.price*10, c=color, label='cluster'+str(label),alpha=0.5)
# plt.scatter(subset.horsepow, subset.mpg)
plt.legend()
plt.title('Clusters')
plt.xlabel('horsepow')
plt.ylabel('mpg')
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points. *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points. *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points. *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points. *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points. *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
Text(0, 0.5, 'mpg')
Como puede ver, estamos viendo la distribución de cada grupo usando el diagrama de dispersión, pero no está muy claro dónde está el centroide de cada grupo. Además, hay 2 tipos de vehículos en nuestro conjunto de datos, "camión" (valor de 1 en la columna de tipo) y "automóvil" (valor de 1 en la columna de tipo). Entonces, los usamos para distinguir las clases y resumir el grupo. Primero contamos el número de casos en cada grupo:
pdf.groupby(['cluster_','type'])['cluster_'].count()
cluster_ type 0 0.0 29 1.0 14 1 0.0 10 2 0.0 26 1.0 4 3 0.0 21 1.0 11 4 0.0 1 5 0.0 1 Name: cluster_, dtype: int64
agg_cars = pdf.groupby(['cluster_','type'])['horsepow','engine_s','mpg','price'].mean()
agg_cars
horsepow | engine_s | mpg | price | ||
---|---|---|---|---|---|
cluster_ | type | ||||
0 | 0.0 | 210.551724 | 3.420690 | 23.648276 | 30.449310 |
1.0 | 206.428571 | 4.064286 | 18.500000 | 28.727714 | |
1 | 0.0 | 294.700000 | 4.380000 | 21.600000 | 57.864000 |
2 | 0.0 | 121.230769 | 1.934615 | 29.115385 | 14.720385 |
1.0 | 133.750000 | 2.225000 | 22.750000 | 15.856500 | |
3 | 0.0 | 160.857143 | 2.680952 | 24.857143 | 19.822048 |
1.0 | 154.272727 | 2.936364 | 20.909091 | 21.199364 | |
4 | 0.0 | 55.000000 | 1.000000 | 45.000000 | 9.235000 |
5 | 0.0 | 450.000000 | 8.000000 | 16.000000 | 69.725000 |
Tenemos 3 clusteres principales con la mayoria de vehiculos.
Cars:
Trucks:
Tenga en cuenta que no utilizamos el type y el precio de los coches en el proceso de agrupación, pero la agrupación jerárquica podría forjar las agrupaciones y discriminarlas con una precisión bastante alta.
plt.figure(figsize=(16,10))
for color, label in zip(colors, cluster_labels):
subset = agg_cars.loc[(label,),]
for i in subset.index:
plt.text(subset.loc[i][0]+5, subset.loc[i][2], 'type='+str(int(i)) + ', price='+str(int(subset.loc[i][3]))+'k')
plt.scatter(subset.horsepow, subset.mpg, s=subset.price*20, c=color, label='cluster'+str(label))
plt.legend()
plt.title('Clusters')
plt.xlabel('horsepow')
plt.ylabel('mpg')
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points. *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points. *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points. *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points. *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points. *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
Text(0, 0.5, 'mpg')