The pca
tool can be used to implement a number of dimensionality reduction methods on SNP data (PCA, t-SNE, UMAP) and to filter and/or impute missing data in genotype matrices to reduce the effects of missing data.
# conda install ipyrad -c conda-forge -c bioconda
# conda install ipcoal -c conda-forge
# conda install scikit-learn -c conda-forge
import ipyrad.analysis as ipa
import toyplot
print(ipa.__version__)
print(toyplot.__version__)
0.9.61 0.19.0
# the simulated SNP database file
SNPS = "/tmp/oaks.snps.hdf5"
# download example hdf5 dataset (158Mb, takes ~2-3 minutes)
URL = "https://www.dropbox.com/s/x6a4i47xqum27fo/virentes_ref.snps.hdf5?raw=1"
ipa.download(url=URL, path=SNPS);
file already exists
IMAP = {
"virg": ["LALC2", "TXWV2", "FLBA140", "FLSF33", "SCCU3"],
"mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
"gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
"bran": ["BJSL25", "BJSB3", "BJVL19"],
"fusi": ["MXED8", "MXGT4", "TXMD3", "TXGR3"],
"sagr": ["CUCA4", "CUSV6", "CUVN10"],
"oleo": ["MXSA3017", "BZBB1", "HNDA09", "CRL0030", "CRL0001"],
}
MINMAP = {
"virg": 3,
"mini": 3,
"gemi": 3,
"bran": 2,
"fusi": 2,
"sagr": 2,
"oleo": 3,
}
tool = ipa.pca(data=SNPS, minmaf=0.05, imap=IMAP, minmap=MINMAP, impute_method="sample")
Samples: 26 Sites before filtering: 1182005 Filtered (indels): 0 Filtered (bi-allel): 26249 Filtered (mincov): 142749 Filtered (minmap): 876036 Filtered (subsample invariant): 600226 Filtered (minor allele frequency): 494278 Filtered (combined): 1068892 Sites after filtering: 74034 Sites containing missing values: 61935 (83.66%) Missing values in SNP matrix: 140810 (7.32%) Imputation: 'sampled'; (0, 1, 2) = 61.2%, 17.0%, 21.8%
Unlinked SNPs are automatically sampled from each locus. By setting nreplicates=N
the subsampling procedure is repeated N times to show variation over the subsampled SNPs. The imap dictionary is used in the .draw()
function to color points, and can be overriden to color points differently from the IMAP used in the tool above.
tool.run(nreplicates=10)
tool.draw(imap=IMAP);
Subsampling SNPs: 25092/100093
# a convenience function for plotting across three axes
tool.draw_panels(0, 1, 2, imap=IMAP);
t-SNE is a manifold learning algorithm that can sometimes better project data into a 2-dimensional plane. The distances between points in this space are harder to interpret.
tool.run_tsne(perplexity=5, seed=333)
tool.draw(imap=IMAP);
Subsampling SNPs: 25092/100093
UMAP is similar to t-SNE but the distances between clusters are more representative of the differences betwen groups. This requires another package that if it is not yet installed it will ask you to install.
tool.run_umap(n_neighbors=13, seed=333)
tool.draw(imap=IMAP);
Subsampling SNPs: 25092/100093
Missing data has large effects on dimensionality reduction methods, and it is best to (1) minimize the amount of missing data in your input data set by using filtering, and (2) impute missing data values. In the examples above data is imputed using the 'sample' method, which probabilistically samples alleles for based on the allele frequency in the group that a taxon is assigned to in IMAP. It is good to compare this to a case where imputation is performed without IMAP assignments, to assess the impact of the a priori assignments. Although this comparison is useful, assigning taxa to groups with IMAP dictionaries for imputation is expected to yield more accurate imputation.
# allow very little missing data
import itertools
tool = ipa.pca(
data=SNPS,
imap={'samples': list(itertools.chain(*[i for i in IMAP.values()]))},
minmaf=0.05,
mincov=0.9,
impute_method="sample",
quiet=True,
)
tool.run(nreplicates=10, seed=123)
tool.draw(imap=IMAP);
# variance explained by each PC axes in the first replicate run
tool.variances[0].round(2)
array([0.16, 0.15, 0.06, 0.05, 0.04, 0.04, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.01, 0.01, 0. ])
# PC loadings in the first replicate
tool.pcs(0)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BJSB3 | -4.610 | 74.151 | -18.954 | -26.073 | 0.587 | 0.241 | -1.056 | -2.311 | -1.129 | 0.826 | ... | 1.739 | 1.861 | -0.803 | -2.532 | 0.292 | 45.451 | -3.763 | -6.647 | 0.511 | 5.150e-15 |
BJSL25 | -4.897 | 70.815 | -17.631 | -22.660 | -0.514 | -0.194 | -3.377 | -1.688 | -0.420 | 2.316 | ... | 0.284 | -1.698 | 2.079 | 0.757 | 0.322 | -17.017 | 4.234 | 39.465 | -0.777 | -3.872e-15 |
BJVL19 | -5.211 | 70.611 | -19.109 | -23.022 | 0.183 | -0.428 | -2.558 | -1.428 | -0.932 | 1.365 | ... | -0.139 | -0.828 | -0.356 | 0.900 | -0.154 | -31.057 | 0.534 | -32.138 | 0.011 | 5.343e-15 |
BZBB1 | 59.008 | -7.891 | 3.739 | 1.181 | 9.771 | -21.487 | -5.505 | -10.850 | 4.033 | 2.517 | ... | -23.674 | -7.355 | 44.886 | -0.201 | 0.726 | 0.436 | -12.788 | -0.580 | -1.093 | 1.349e-14 |
CRL0001 | 48.873 | -14.312 | -3.244 | -1.591 | -4.756 | 24.637 | 7.354 | -10.712 | -12.492 | -3.943 | ... | -2.002 | -0.992 | -1.080 | -1.470 | 0.523 | -0.487 | 7.434 | 0.335 | 35.415 | 1.801e-14 |
CRL0030 | 65.379 | -10.158 | 1.379 | 0.016 | 7.581 | -12.391 | -2.594 | -8.762 | 0.429 | 1.645 | ... | -16.870 | -1.965 | -15.548 | -1.647 | 2.670 | 3.929 | 41.397 | -2.348 | -11.342 | 1.248e-14 |
CUCA4 | 28.689 | -17.756 | -5.968 | -4.747 | -25.342 | -13.972 | -15.904 | 60.494 | -27.650 | -2.318 | ... | 3.390 | 0.035 | 1.886 | 1.354 | -1.875 | 0.528 | 0.130 | -0.005 | 0.234 | -5.325e-15 |
CUSV6 | 25.507 | -18.772 | -5.325 | -5.905 | -13.657 | 25.755 | -3.049 | 23.135 | 60.079 | 13.485 | ... | -0.083 | 1.558 | 0.302 | -0.601 | -1.812 | 0.539 | -1.068 | -0.345 | -0.248 | 7.175e-15 |
CUVN10 | 36.071 | -18.435 | -5.100 | -3.779 | -13.754 | 53.098 | 14.693 | -10.917 | -24.551 | -6.453 | ... | 4.404 | 0.415 | 7.293 | -0.120 | -0.499 | -0.757 | -6.956 | -0.045 | -20.352 | 1.455e-14 |
FLAB109 | -27.347 | -23.411 | -22.541 | 6.585 | -17.245 | -6.095 | 3.553 | -13.351 | 0.285 | -0.432 | ... | -9.764 | 2.632 | -4.347 | -0.406 | -20.536 | -0.271 | 1.277 | 1.406 | -0.214 | 5.870e-15 |
FLBA140 | -27.255 | -18.524 | 24.450 | -17.597 | 1.065 | 1.099 | -6.548 | -4.178 | -0.478 | -2.337 | ... | 0.620 | -20.521 | -2.257 | 49.330 | -7.602 | 2.767 | 1.443 | -0.570 | 0.496 | 6.641e-15 |
FLCK18 | -27.442 | -22.547 | -22.679 | 6.941 | -19.645 | -12.994 | 0.340 | -9.506 | -0.153 | 1.499 | ... | 0.083 | 3.186 | 1.081 | -0.757 | -10.247 | -0.010 | -0.548 | -0.900 | -0.054 | -1.586e-15 |
FLCK216 | -25.219 | -19.694 | -21.351 | 8.456 | 20.746 | 3.235 | -6.158 | 2.458 | 6.057 | -11.410 | ... | -1.026 | -4.151 | 0.246 | -0.832 | -1.569 | -0.220 | -0.991 | 0.159 | -0.406 | 8.143e-15 |
FLMO62 | -24.339 | -17.004 | -18.577 | 10.045 | 21.703 | 2.052 | -2.756 | 7.499 | -2.931 | 6.725 | ... | -2.491 | 3.808 | 0.269 | 0.365 | -0.727 | -0.144 | 0.970 | 0.575 | 0.632 | 1.325e-15 |
FLSA185 | -23.002 | -14.798 | -21.469 | 14.032 | 55.138 | 13.869 | -1.326 | 15.814 | -4.724 | 4.430 | ... | 0.218 | 0.192 | -0.342 | -0.226 | 1.249 | -0.003 | 0.303 | 0.127 | -0.266 | 5.993e-16 |
FLSF33 | -29.098 | -20.067 | 28.700 | -18.672 | 1.083 | -1.706 | -6.625 | -2.339 | -4.928 | -1.446 | ... | -9.273 | 40.859 | 1.740 | -7.499 | -2.054 | -1.543 | 0.060 | 0.526 | 0.382 | 1.520e-14 |
FLSF47 | -25.578 | -21.277 | -17.202 | 6.783 | 0.406 | -5.679 | 0.162 | -5.437 | -10.473 | 4.616 | ... | 1.728 | -2.037 | 0.717 | -1.482 | -4.525 | 2.779 | -1.255 | -0.620 | 0.816 | 3.046e-15 |
FLSF54 | -26.729 | -23.262 | -15.358 | 3.753 | -16.081 | -7.160 | 2.704 | -7.128 | 1.987 | -1.853 | ... | 2.786 | 2.569 | 1.066 | 7.329 | 53.344 | 0.025 | 0.218 | 0.002 | -0.061 | 4.989e-15 |
FLWO6 | -27.152 | -23.355 | -17.863 | 5.182 | -25.241 | -11.059 | 3.364 | -13.860 | 8.296 | -4.614 | ... | 4.424 | -1.943 | -0.669 | -6.099 | -8.897 | -0.303 | 0.600 | -0.089 | -0.744 | 5.412e-16 |
HNDA09 | 63.228 | -9.392 | 1.370 | -0.739 | 8.133 | -14.948 | -3.471 | -9.066 | -0.608 | 3.730 | ... | -9.506 | -0.481 | -37.978 | 0.261 | 2.431 | -3.157 | -33.027 | 2.726 | -1.791 | 6.453e-15 |
LALC2 | -27.999 | -20.481 | 26.148 | -17.820 | 1.279 | -2.240 | -1.612 | -0.461 | -1.580 | 0.043 | ... | 8.728 | -39.895 | -2.325 | -32.678 | 2.609 | -0.370 | 0.287 | -0.300 | -0.307 | 6.276e-15 |
MXED8 | -0.865 | 37.463 | 17.114 | 34.215 | -2.978 | -1.335 | 10.300 | 8.005 | 3.227 | -28.786 | ... | -5.141 | -3.268 | -1.844 | -2.208 | -1.518 | -0.215 | -0.886 | 0.676 | -0.322 | 1.025e-14 |
MXGT4 | -6.112 | 44.646 | 17.533 | 29.862 | -4.557 | 2.026 | 6.206 | 6.791 | 8.412 | -22.120 | ... | -6.112 | 0.630 | -2.181 | 2.240 | 1.766 | 1.183 | -1.157 | -0.528 | 0.432 | 1.382e-14 |
MXSA3017 | 53.256 | -7.713 | 3.466 | 2.372 | 12.366 | -18.793 | -1.507 | -10.907 | 6.243 | -4.876 | ... | 52.928 | 10.552 | 6.829 | 2.753 | -4.458 | -0.931 | 2.271 | -0.021 | -1.213 | 5.627e-15 |
SCCU3 | -26.268 | -17.772 | 34.307 | -26.482 | 6.734 | 8.784 | -32.502 | -6.384 | 2.695 | -13.612 | ... | -2.683 | 8.748 | -0.858 | -6.687 | 1.802 | -0.599 | -0.454 | -0.106 | 0.195 | 6.228e-16 |
TXGR3 | -11.573 | 29.699 | 25.049 | 30.186 | -8.234 | 5.636 | -9.980 | -4.686 | -9.879 | 56.818 | ... | 2.577 | -0.045 | 0.119 | -0.108 | 0.866 | 0.746 | 0.604 | -0.698 | 0.366 | -1.291e-15 |
TXMD3 | -13.722 | 28.559 | 25.621 | 30.245 | -3.552 | 3.446 | -5.903 | -3.708 | -0.026 | -5.823 | ... | 6.395 | 2.715 | 1.365 | -0.384 | -1.193 | -0.126 | 1.150 | -0.409 | -0.478 | -3.965e-14 |
TXWV2 | -15.595 | -9.321 | 23.495 | -20.767 | 8.781 | -13.396 | 63.757 | 13.482 | 1.209 | 10.009 | ... | -1.541 | 5.421 | 0.710 | 0.646 | -0.936 | -1.173 | -0.021 | 0.352 | 0.178 | -4.576e-14 |
28 rows × 28 columns
The .draw()
function returns a canvas and axes object from toyplot which can be further modified and styled.
# get plot objects, several styling options to draw
canvas, axes = tool.draw(imap=IMAP, size=8, width=400);
# various axes styling options shown for x axis
axes.x.ticks.show = True
axes.x.spine.style['stroke-width'] = 1.5
axes.x.ticks.labels.style['font-size'] = '13px'
axes.x.label.style['font-size'] = "15px"
axes.x.label.offset = "22px"