#!/usr/bin/env python # coding: utf-8 # # Visualization with hierarchical clustering and t-SNE # > A Summary of lecture "Unsupervised Learning with scikit-learn", via datacamp # # - toc: true # - badges: true # - comments: true # - author: Chanseok Kang # - categories: [Python, Datacamp, Machine Learning, Visualization] # - image: # In[1]: import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # ## Visualizing hierarchies # - Visualizations communicate insight # - 't-SNE': Creates a 2D map of a dataset # - 'Hierarchical clustering' # - A hierarchy of groups # - Groups of living things can form a hierarchy # - Cluster are contained in one another # - Hierarchical clustering # - Every element begins in a separate cluster # - At each step, the two closest clusters are merged # - Continue until all elements in a single cluster # - This is **"agglomerative"(or divisive)** hierarchical clustering # ### Hierarchical clustering of the grain data # In the video, you learned that the SciPy ```linkage()``` function performs hierarchical clustering on an array of samples. Use the ```linkage()``` function to obtain a hierarchical clustering of the grain samples, and use ```dendrogram()``` to visualize the result. A sample of the grain measurements is provided in the array ```samples```, while the variety of each grain sample is given by the list ```varieties```. # #### Preprocess # In[2]: df = pd.read_csv('./dataset/seeds.csv', header=None) df[7] = df[7].map({1:'Kama wheat', 2:'Rosa wheat', 3:'Canadian wheat'}) df.head() # In[3]: samples = df.iloc[:, :-1].values varieties = df.iloc[:, -1].values # In[4]: from scipy.cluster.hierarchy import linkage, dendrogram # Calculate the linkage: mergings mergings = linkage(samples, method='complete') # Plot the dendrogram, using varieties as labels plt.figure(figsize=(15, 5)) dendrogram(mergings, labels=varieties, leaf_rotation=90, leaf_font_size=6, ); # ### Hierarchies of stocks # In chapter 1, you used k-means clustering to cluster companies according to their stock price movements. Now, you'll perform hierarchical clustering of the companies. You are given a NumPy array of price movements ```movements```, where the rows correspond to companies, and a list of the company names companies. SciPy hierarchical clustering doesn't fit into a sklearn pipeline, so you'll need to use the ```normalize()``` function from ```sklearn.preprocessing``` instead of ```Normalizer```. # #### Preprocess # In[5]: df = pd.read_csv('./dataset/company-stock-movements-2010-2015-incl.csv', index_col=0) df.head() # In[6]: movements = df.values companies = df.index.values # In[7]: from sklearn.preprocessing import normalize # Normalize the movements: normalize_movements normalized_movements = normalize(movements) # Calculate the linkage: mergings mergings = linkage(normalized_movements, method='complete') # Plot the dendrogram plt.figure(figsize=(15, 5)) dendrogram(mergings, labels=companies, leaf_rotation=90, leaf_font_size=6); # ## Cluster labels in hierarchical clustering # - Intermediate clusterings & height on dendrogram # - Height on dendrogram specifies max. distance between merging clusters # - Don't merge clusters further apart than this. # - Distance between clusters # - Defined by "linkage method" # - In "complete" linkage: distance between clusters is max. distance between their samples # - Different linkage method, different hierarchical clustering # ### Different linkage, different hierarchical clustering! # In the video, you saw a hierarchical clustering of the voting countries at the Eurovision song contest using ```'complete'``` linkage. Now, perform a hierarchical clustering of the voting countries with ```'single'``` linkage, and compare the resulting dendrogram with the one in the video. Different linkage, different hierarchical clustering! # # You are given an array ```samples```. Each row corresponds to a voting country, and each column corresponds to a performance that was voted for. The list ```country_names``` gives the name of each voting country. This dataset was obtained from [Eurovision](http://www.eurovision.tv/page/results). # #### Preprocess # # In[8]: df = pd.read_csv('./dataset/eurovision-2016.csv') df # In[9]: samples = df.iloc[:, 2:7].values[:42] country_names = df.iloc[:, 1].values[:42] # In[10]: # Calculate the linkage: mergings mergings = linkage(samples, method='single') # Plot the dendrogram plt.figure(figsize=(15, 5)) dendrogram(mergings, labels=country_names, leaf_rotation=90, leaf_font_size=6); # ### Extracting the cluster labels # In the previous exercise, you saw that the intermediate clustering of the grain samples at height 6 has 3 clusters. Now, use the ```fcluster()``` function to extract the cluster labels for this intermediate clustering, and compare the labels with the grain varieties using a cross-tabulation. # #### Preprocess # In[11]: df = pd.read_csv('./dataset/seeds.csv', header=None) df[7] = df[7].map({1:'Kama wheat', 2:'Rosa wheat', 3:'Canadian wheat'}) df.head() # In[12]: samples = df.iloc[:, :-1].values varieties = df.iloc[:, -1].values # In[13]: from scipy.cluster.hierarchy import fcluster mergings = linkage(samples, method='complete') # Use fcluster to extract labels: labels labels = fcluster(mergings, 6, criterion='distance') # Create a DataFrame with labels and varieties as columns: df df = pd.DataFrame({'labels': labels, 'varieties': varieties}) # Create crosstab ct = pd.crosstab(df['labels'], df['varieties']) # Display ct print(ct) # ## t-SNE for 2-dimensional maps # - t-SNE for 2-dimensional maps # - t-SNE = "t-distributed stochastic neighbor embedding" # - Maps samples to 2D space (or 3D) # - Map approximately preserves nearness of samples # - Great for inspecting dataset # ### t-SNE visualization of grain dataset # In the video, you saw t-SNE applied to the iris dataset. In this exercise, you'll apply t-SNE to the grain samples data and inspect the resulting t-SNE features using a scatter plot. # ### Preprocess # In[14]: df = pd.read_csv('./dataset/seeds.csv', header=None) samples = df.iloc[:, :-1].values variety_numbers = df.iloc[:, -1].values # In[23]: from sklearn.manifold import TSNE # Create a TSNE instance: model model = TSNE(learning_rate=200) # Apply fit_transform to samples: tsne_features tsne_features = model.fit_transform(samples) # Select the 0th feature: xs xs = tsne_features[:, 0] # Select the 1st feature: ys ys = tsne_features[:, 1] # Scatter plot, coloring by variety_numbers plt.scatter(xs, ys, c=variety_numbers); # ### A t-SNE map of the stock market # t-SNE provides great visualizations when the individual samples can be labeled. In this exercise, you'll apply t-SNE to the company stock price data. A scatter plot of the resulting t-SNE features, labeled by the company names, gives you a map of the stock market! The stock price movements for each company are available as the array ```normalized_movements``` (these have already been normalized for you). The list ```companies``` gives the name of each company. # #### Preprocess # In[16]: df = pd.read_csv('./dataset/company-stock-movements-2010-2015-incl.csv', index_col=0) movements = df.values companies = df.index.values normalized_movements = normalize(movements) # In[22]: # Create a TSNE instance: model model = TSNE(learning_rate=50) # Apply fit_transform to normalized_movements: tsne_features tsne_features = model.fit_transform(normalized_movements) # Select the 0th feature: xs xs = tsne_features[:, 0] # Select the 1st feature: ys ys = tsne_features[:, 1] # Scatter plot plt.figure(figsize=(10, 10)) plt.scatter(xs, ys, alpha=0.5) # Annotate the points for x, y, company in zip(xs, ys, companies): plt.annotate(company, (x, y), fontsize=8, alpha=0.75)