Using species - unit to get distance, may be better???
Raw reads: downloaded from JGI
HPC: /mnt/scratch/tg/qingpeng/2014JGI_Data/
HPC: /mnt/scratch/tg/qingpeng/2014JGI_Data_Work/
quality_trimming.qsub
*.qsub
Output: *.trimmed.fasta.gz
[qingpeng@dev-intel14 2014JGI_Data_Work]$ ls -lh *.fasta.gz
-rw-r--r-- 1 qingpeng tg 46G Apr 29 2014 iowa_corn.trimmed.fasta.gz
-rw-r--r-- 1 qingpeng tg 74G Apr 29 2014 iowa_prairie.trimmed.fasta.gz
-rw-r--r-- 1 qingpeng tg 66G Apr 29 2014 kansas_corn.trimmed.fasta.gz
-rw-r--r-- 1 qingpeng tg 145G Apr 30 2014 kansas_prairie.trimmed.fasta.gz
-rw-r--r-- 1 qingpeng tg 51G Apr 29 2014 wisconsin_corn.trimmed.fasta.gz
-rw-r--r-- 1 qingpeng tg 53G Apr 29 2014 wisconsin_prairie.trimmed.fasta.gz
-rw-r--r-- 1 qingpeng tg 11G Apr 29 2014 wisconsin_restored.trimmed.fasta.gz
-rw-r--r-- 1 qingpeng tg 13G Apr 29 2014 wisconsin_switchgrass.trimmed.fasta.gz
/mnt/scratch/tg/qingpeng/2014JGI_Data_Work/run/
[qingpeng@dev-intel14 run]$ ls load*.qsub
load_iowa_corn.qsub load_kansas_corn.qsub load_wisconsin_corn.qsub load_wisconsin_restored.qsub
load_iowa_prairie.qsub load_kansas_prairie.qsub load_wisconsin_prairie.qsub load_wisconsin_switchgrass.qsub
Output:
[qingpeng@dev-intel14 run]$ ls -lh *.ht
-rw-r--r-- 1 qingpeng tg 131G May 3 2014 iowa_corn.trimmed.fasta.gz.ht
-rw-r--r-- 1 qingpeng tg 213G May 3 2014 iowa_prairie.trimmed.fasta.gz.ht
-rw-r--r-- 1 qingpeng tg 191G May 3 2014 kansas_corn.trimmed.fasta.gz.ht
-rw-r--r-- 1 qingpeng tg 418G May 4 2014 kansas_prairie.trimmed.fasta.gz.ht
-rw-r--r-- 1 qingpeng tg 146G May 3 2014 wisconsin_corn.trimmed.fasta.gz.ht
-rw-r--r-- 1 qingpeng tg 153G May 3 2014 wisconsin_prairie.trimmed.fasta.gz.ht
-rw-r--r-- 1 qingpeng tg 131G May 3 2014 wisconsin_restored.trimmed.fasta.gz.ht
-rw-r--r-- 1 qingpeng tg 131G May 3 2014 wisconsin_switchgrass.trimmed.fasta.gz.ht
/mnt/scratch/tg/qingpeng/2014JGI_Data_Work/run/Count_Pair
wisconsin_restored-in-wisconsin_restored.qsub
wisconsin_restored-in-wisconsin_switchgrass.qsub
wisconsin_switchgrass-in-iowa_corn.qsub
wisconsin_switchgrass-in-iowa_prairie.qsub
wisconsin_switchgrass-in-kansas_corn.qsub
wisconsin_switchgrass-in-kansas_prairie.qsub
wisconsin_switchgrass-in-wisconsin_corn.qsub
wisconsin_switchgrass-in-wisconsin_prairie.qsub
wisconsin_switchgrass-in-wisconsin_restored.qsub
Output: [qingpeng@dev-intel14 Count_Pair]$ ls *.out iowa_corn-in-iowa_corn.out wisconsin_corn-in-iowa_corn.out iowa_corn-in-iowa_prairie.out wisconsin_corn-in-iowa_prairie.out iowa_corn-in-kansas_corn.out wisconsin_corn-in-kansas_corn.out iowa_corn-in-kansas_prairie.out wisconsin_corn-in-kansas_prairie.out iowa_corn-in-wisconsin_corn.out wisconsin_corn-in-wisconsin_corn.out iowa_corn-in-wisconsin_prairie.out wisconsin_corn-in-wisconsin_prairie.out iowa_corn-in-wisconsin_restored.out wisconsin_corn-in-wisconsin_restored.out iowa_corn-in-wisconsin_switchgrass.out wisconsin_corn-in-wisconsin_switchgrass.out iowa_prairie-in-iowa_corn.out wisconsin_prairie-in-iowa_corn.out
[qingpeng@dev-intel14 Count_Pair]$ ls *.out.qsub
iowa_prairie-in-wisconsin_corn.out.qsub kansas_prairie-in-kansas_corn.out.qsub
iowa_prairie-in-wisconsin_restored.out.qsub kansas_prairie-in-wisconsin_corn.out.qsub
Output:
GPGC_distance.txt
/mnt/scratch/tg/qingpeng/2014JGI_Data_Work/run/Count_Pair/Get_Table
rebuild_freq_table_stream.qsub
rebuild_freq_table_stream.py
Output: [qingpeng@dev-intel14 Get_Table]$ ls *.freq_table iowa_corn-in.list.freq_table kansas_prairie-in.list.freq_table wisconsin_restored-in.list.freq_table iowa_prairie-in.list.freq_table wisconsin_corn-in.list.freq_table wisconsin_switchgrass-in.list.freq_table kansas_corn-in.list.freq_table wisconsin_prairie-in.list.freq_table
[qingpeng@dev-intel14 Table]$ ls spectrum.out*
spectrum.out spectrum.out.IGS spectrum.out.IGS_abund
Get the IGS
count_spectrum.qsub
generate MAP.txt file qsub seperate_IGS.py.qsub
get_dm.qsub
matrix.out ( new results as below)
import pandas as pd
import pandas as pd
import numpy as np
import scipy
from IPython.display import HTML
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import pdist, squareform
from IPython.display import Image
from pandas import *
Typical IGS pipeline.
GOS=pd.read_csv('../data/matrix_GPGC_IGS.out',delimiter=' ',header=None)
#label=pd.read_csv('../data/config-GOS.txt',delimiter=' ',header=None)
label = GOS[0]
label_list = []
for i in GOS[0]:
label = i.split('.')[0]
label_list.append(label)
len(label_list)
GOS=GOS.ix[:,1:46]
GOS.columns=label_list
GOS.index=label_list
GOS.to_csv("../data/dm_GPGC_IGS.txt",sep="\t")
GOS
iowa_corn-in | iowa_prairie-in | kansas_corn-in | kansas_prairie-in | wisconsin_corn-in | wisconsin_prairie-in | wisconsin_restored-in | wisconsin_switchgrass-in | |
---|---|---|---|---|---|---|---|---|
iowa_corn-in | 0.000000 | 0.810425 | 1.000000 | 0.682664 | 0.802263 | 1.000000 | 0.998201 | 0.538422 |
iowa_prairie-in | 0.810425 | 0.000000 | 1.000000 | 0.904476 | 0.855969 | 1.000000 | 0.999224 | 0.864518 |
kansas_corn-in | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.952156 | 1.000000 | 1.000000 |
kansas_prairie-in | 0.682664 | 0.904476 | 1.000000 | 0.000000 | 0.934775 | 1.000000 | 0.990498 | 0.841433 |
wisconsin_corn-in | 0.802263 | 0.855969 | 1.000000 | 0.934775 | 0.000000 | 1.000000 | 0.999173 | 0.809836 |
wisconsin_prairie-in | 1.000000 | 1.000000 | 0.952156 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 |
wisconsin_restored-in | 0.998201 | 0.999224 | 1.000000 | 0.990498 | 0.999173 | 1.000000 | 0.000000 | 1.000000 |
wisconsin_switchgrass-in | 0.538422 | 0.864518 | 1.000000 | 0.841433 | 0.809836 | 1.000000 | 1.000000 | 0.000000 |
figure(num=None, figsize=(12, 12))
R = dendrogram(linkage(GOS, method='average'),labels=label_list, leaf_font_size=20,orientation='left')
ylabel('points')
xlabel('Height')
xlim(1,1.6)
suptitle('GOS: average', fontweight='bold', fontsize=20)
<matplotlib.text.Text at 0x10efefa50>
dissimilarity = 1 - (number-of-reads-in-A-covered-by-B + number-of-reads-in-B-covered-by-A)/total-number-of-reads
GOS=pd.read_csv('../data/GPGC_matrix.txt',delimiter=',',header=None)
#label=pd.read_csv('../data/config-GOS.txt',delimiter=' ',header=None)
#print GOS
label = GOS[0]
#print label
label_list = []
for i in GOS[0]:
label = i.split(',')[0]
# print label
label_list.append(label)
#print len(label_list)
GOS=GOS.ix[:,1:9]
GOS.columns=label_list
GOS.index=label_list
GOS =1-GOS
GOS
iowa_corn | iowa_prairie | kansas_corn | kansas_prairie | wisconsin_corn | wisconsin_prairie | wisconsin_restored | wisconsin_switchgrass | |
---|---|---|---|---|---|---|---|---|
iowa_corn | 0.000000 | 0.551921 | 0.677156 | 0.719225 | 0.734101 | 0.615157 | 0.429304 | 0.499282 |
iowa_prairie | 0.551921 | 0.000000 | 0.583938 | 0.646602 | 0.585341 | 0.438256 | 0.328167 | 0.373863 |
kansas_corn | 0.677156 | 0.583938 | 0.000000 | 0.789139 | 0.767876 | 0.623198 | 0.377628 | 0.461251 |
kansas_prairie | 0.719225 | 0.646602 | 0.789139 | 0.000000 | 0.798620 | 0.708182 | 0.410425 | 0.462230 |
wisconsin_corn | 0.734101 | 0.585341 | 0.767876 | 0.798620 | 0.000000 | 0.683612 | 0.578405 | 0.661665 |
wisconsin_prairie | 0.615157 | 0.438256 | 0.623198 | 0.708182 | 0.683612 | 0.000000 | 0.386761 | 0.420220 |
wisconsin_restored | 0.429304 | 0.328167 | 0.377628 | 0.410425 | 0.578405 | 0.386761 | 0.000000 | 0.574357 |
wisconsin_switchgrass | 0.499282 | 0.373863 | 0.461251 | 0.462230 | 0.661665 | 0.420220 | 0.574357 | 0.000000 |
figure(num=None, figsize=(12, 12))
R = dendrogram(linkage(GOS, method='average'),labels=label_list, leaf_font_size=20,orientation='left')
ylabel('points')
xlabel('Height')
#xlim(1,1.6)
suptitle('GPGC: average', fontweight='bold', fontsize=20)
<matplotlib.text.Text at 0x10ee6cf50>