Example data has been downloaded from the open access Human Gene Expression Atlas and represents typical data bioinformaticians work with.
It is "Transcription profiling by array of brain in humans, chimpanzees and macaques, and brain, heart, kidney and liver in orangutans" experiment in a tab-separated format.
data = np.genfromtxt("data/ExpRawData-E-TABM-84-A-AFFY-44.tab",names=True,usecols=tuple(range(1,30)),dtype=float, delimiter="\t")
print len(data)
print len(data.dtype.names)
54674 29
data_array = data.view((np.float, len(data.dtype.names)))
data_array = data_array.transpose()
Let's have a look at how our data actually looks like
print data_array
[[ 6.3739767 5.986182 7.468118 ..., 11.745089 13.277803 13.067169 ] [ 6.4981704 4.861167 6.9479957 ..., 11.33983 13.031992 12.7244425] [ 6.271771 5.5666986 6.9435835 ..., 11.840397 13.332481 13.051369 ] ..., [ 6.112102 5.0324726 6.93343 ..., 11.411251 13.084589 12.768577 ] [ 6.359853 6.12955 8.460821 ..., 10.814936 12.888793 12.673793 ] [ 6.010906 5.4538217 6.8208113 ..., 11.467866 13.08952 12.792053 ]]
First, we'll implement the clustering using scipy modules
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram
data_dist = pdist(data_array) # computing the distance
data_link = linkage(data_dist) # computing the linkage
dendrogram(data_link,labels=data.dtype.names)
plt.xlabel('Samples')
plt.ylabel('Distance')
plt.suptitle('Samples clustering', fontweight='bold', fontsize=14);
Plotting a heatmap that many R users are used to is a tricky endevour in Python
# Compute and plot first dendrogram.
fig = plt.figure(figsize=(8,8))
# x ywidth height
ax1 = fig.add_axes([0.05,0.1,0.2,0.6])
Y = linkage(data_dist, method='single')
Z1 = dendrogram(Y, orientation='right',labels=data.dtype.names) # adding/removing the axes
ax1.set_xticks([])
# Compute and plot second dendrogram.
ax2 = fig.add_axes([0.3,0.71,0.6,0.2])
Z2 = dendrogram(Y)
ax2.set_xticks([])
ax2.set_yticks([])
#Compute and plot the heatmap
axmatrix = fig.add_axes([0.3,0.1,0.6,0.6])
idx1 = Z1['leaves']
idx2 = Z2['leaves']
D = squareform(data_dist)
D = D[idx1,:]
D = D[:,idx2]
im = axmatrix.matshow(D, aspect='auto', origin='lower', cmap=plt.cm.YlGnBu)
axmatrix.set_xticks([])
axmatrix.set_yticks([])
# Plot colorbar.
axcolor = fig.add_axes([0.91,0.1,0.02,0.6])
plt.colorbar(im, cax=axcolor)
<matplotlib.colorbar.Colorbar instance at 0x109a023f8>
Learn more about the fastcluster module http://math.stanford.edu/~muellner/fastcluster.html?section=0
from fastcluster import *
%timeit data_link = linkage(data_array, method='single', metric='euclidean', preserve_input=True)
dendrogram(data_link,labels=data.dtype.names)
plt.xlabel('Samples')
plt.ylabel('Distance')
plt.suptitle('Samples clustering', fontweight='bold', fontsize=14);
show()
10 loops, best of 3: 41.9 ms per loop
Biopython example would be rather specfic to bioinformatics. The output of the clustering should be viewed with TreeView program First, we need to read the data to a special bioipython.record object.
from Bio.Cluster import *
handle = open("../data/ExpRawData-E-TABM-84-A-AFFY-44.tab")
record = read(handle)
To fully appreciate this approach, we'll perform the clustering both by genes and by samples. Gene clustering will take a while in this example, subset the data to see the results faster. Fastcluster is also capable to perform gene clustering in a reasonable time.
genetree = record.treecluster(method='s')
genetree.scale()
exptree = record.treecluster(dist='u', transpose=1)
record.save("../results/biopython_clustering", genetree, exptree)