Set tests are a powerful approach for association testing between groups of genetic variants and quantitative traits. In this tutorial we demonstrate how to use set tests within the LIMIX framework to test for association (mtSet).
Multi Trait Set test is an implementation of efficient set test algorithms for testing for associations between multiple genetic variants and multiple traits.
mtSet can account for confounding factors such as relatedness and can be used for analysis of single traits.
mtSet can be used both with the command line interface using the limix scripts (mtSet_preprocess
, mtSet_analyze
, mtSet_postprocess
, mtSet_simPheno
) or within python.
In the following, we give a brief example on how to use mtSet. As a case study, we use a subset of the genotypes from the 1000 project [1] and simulated phenotypes.
All commands can be found in _demos/runmtSet.sh
. In the following, we give a short summary of the individual steps. A demo for running mtSet-PC can be found in _demos/runmtSetPC.sh
and it is not showcased here.
Our software depends on Plink version 1.9 (or greater) for preprocessing. Please, make sure you have it before proceeding.
Download and install Limix
git clone --depth 1 https://github.com/PMBio/limix.git
pushd limix
python setup.py install
popd
git clone --depth 1 https://github.com/PMBio/limix-tutorials.git
cd limix-tutorials/mtSet
python download_examples.py
cd data
mkdir out
ls 1000g/
BFILE=1000g/chrom22_subsample20_maf0.10
CFILE=out/chrom22
PFILE=1000g/pheno
WFILE=out/windows
NFILE=out/null
WSIZE=30000
RESDIR=out/results
OUTFILE=out/final
# Kinship matrix estimation
mtSet_preprocess --compute_covariance --bfile $BFILE --cfile $CFILE
# Fitting the null model and assigning the markers to windows
mtSet_preprocess --precompute_windows --fit_null --bfile $BFILE --cfile $CFILE --pfile $PFILE --wfile $WFILE --nfile $NFILE --window_size $WSIZE --plot_windows
mtSet_analyze --bfile $BFILE --cfile $CFILE --pfile $PFILE --nfile $NFILE --wfile $WFILE --minSnps 4 --resdir $RESDIR --start_wnd 0 --end_wnd 100
for i in `seq 0 10`; do
mtSet_analyze --bfile $BFILE --cfile $CFILE --pfile $PFILE --nfile $NFILE --wfile $WFILE --minSnps 4 --resdir $RESDIR --start_wnd 0 --end_wnd 100 --perm $i
done
mtSet_postprocess --resdir $RESDIR --outfile $OUTFILE --manhattan_plot
Before getting started, we have to compute the sample-to-sample genetic covariance matrix, assign the markers to windows and estimate the trait-to-trait covariance matrices on the null model.
The covariance matrix can be pre-computed as follows:
mtSet_preprocess --compute_covariance --plink_path plink_path --bfile bfile --cfile cfile
where
The principal components can be pre-computed as follows:
mtSet_preprocess --compute_PCs k --plink_path plink_path --ffile ffile --bfile bfile
where
To efficiently apply mtSet, it is neccessary to compute the null model beforehand. This can be done with the following command:
mtSet_preprocess --fit_null --bfile bfile --cfile cfile --nfile nfile --pfile pfile --ffile ffile --trait_idx trait_idx
where
If cfile is not set, the relatedness component is omitted from the model.
The estimated parameters are saved in nfile.p0, the negative log likelihood ratio in nfile.nll0, the trait-to-trait genetic covariance matrix in nfile.cg0 and the trait-to-trait residual covariance matrix in nfile.cn0.
--trait_idx 3,4
selects the phenotypes saved in the forth and fifth column (indexing starts with zero).Notice that phenotypes are standardized prior to model fitting.
For applying our set test, the markers have to be assigned to windows. We provide a method that splits the genome in windows of fixed sizes:
mtSet_preprocess --precompute_windows --bfile bfile --wfile wfile --window_size window_size --plot_windows
where
Here, we provided the commands to execute the three preprocessing operations individually. However, it is also possible to combine all steps in a single command:
mtSet_preprocess --compute_covariance --fit_null --precompute_windows ...
Our software package also includes a command-line simulator that allows to generate phenotypes with a wide range of different genetic architectures. In brief, the simulator assumes a linear-additive model, considering the contribution of a randomly selected (causal) genetic region for the set component, polygenic background effects from all remaining genome-wide variants, a contribution from unmeasured factors and iid observation noise. For a detailed description of the simulation procedure, we refer to the Supplementary Methods.
The simulator requires as input the genotypes and the relatedness component:
mtSet_simPheno --bfile bfile --cfile cfile --pfile pfile
where
If none is specified, the covariance matrix is expected to be in the current folder, having the same filename as the bed file.
The file pfile.phe contains the phenotypic values (each sample is saved in one row, each trait in one column). The file pfile.region contains the randomly selected causal region (chromsom, start position, end position). If pfile is not specified, the files are saved in the current folder having an automatic generated filename containing the bed filename and the values of all simulation parameters.
By changing the following parameters different genetic architectures can be simulated and, in particular, the simulation experiments of our paper can be reproduced.
Option | Default | Datatype | Explanation |
---|---|---|---|
--seed |
0 | int | seed for random number generator |
--nTraits |
4 | int | number of simulated phenotypes |
--windowSize |
1.5e4 | int | size of causal region |
--vTotR |
0.05 | float | variance explained by the causal region |
--nCausalR |
10 | int | number of causal variants in the region |
--pCommonR |
0.8 | float | percentage of shared causal variants |
--vTotBg |
0.4 | float | variance explained by the polygenic background effects |
--pHidden |
0.6 | float | residual variance explained by hidden confounders (in %) |
--pCommon |
0.8 | float | background and residual signal that is shared across traits (in %) |
--chrom |
None | int | specifies the chromosome of the causal region |
--minPos |
None | int | specifies the min. chromosomal position of the causal region (in basepairs) |
--maxPos |
None | int | specifies the max. chromosomal position of the causal region (in basepairs) |
Once the preprocessing step has been used to obtain the genetic relatedness matrix, to fit the null model and to identify the genetic regions to be considered in the analysis, the set test can be run by the following analysis script:
mtSet_analyze --bfile bfile --cfile cfile --pfile pfile --nfile nfile --wfile wfile --ffile ffile --minSnps minSnps --start_wnd start_wnd --end_wnd end_wnd --resdir rdir --trait_idx traitIdx
where
rdir is the directory to which the results are exported. The results are exported in the folder rdir/perm if a permutation seed has been set, otherwise in the folder rdir/test. The output file is named - start_wnd_endwnd.res and contains results in the following format: window index, chromosome, start position, stop position, index of start position, number of SNPs and log likelihood ratio.
Notice that phenotypes are standardized prior to model fitting.
After running mtSet, the following script can be used to merge the result files and estimate the p-values (p-values are obtained by a parametric fit of the test statistics):
mtSet_postprocess --resdir resdir --outfile outfile --manhattan_plot
where
outfile.perm lists the test statistics (first column) and p-values (second column) of the permutated windows outfile.test contains the (index, chromosome, start position, stop position, SNP index, number of SNPs, test statistics and p-value) of each window. Each window is saved in one row.
This part of the tutorial shows how to use mtSet within python. For a tutorial on how to use mtSet from the command line using the limix scripts (mtSet_preprocess, mtSet_analyze, mtSet_postprocess, mtSet_simPheno) please refer to sections Processing, Phenotype simulation, Running analysis, and Postprocessing.
# activiate inline plotting
%matplotlib inline
from download_examples import get_1000G_mtSet
import scipy as sp
import scipy.linalg
import limix
# loading 1000G genotypes for mtSet demo
get_1000G_mtSet()
# base name for bed, bim and fam
bfile = './data/1000g/chrom22_subsample20_maf0.10'
from limix.mtSet.core import plink_reader
# import genotype positions
bim = plink_reader.readBIM(bfile,usecols=(0,1,2,3))
chrom = bim[:, 0].astype(float)
pos = bim[:, -1].astype(float)
# uses splitter to split the genotypes
from limix.mtSet.core.splitter import Splitter
split = Splitter(pos=pos,chrom=chrom)
The method splitGeno
allows to define the regions that will then considered for the analysis with mtSet.
Information relative to the calculated regions can be cached in an external file by activating the cache option (see below).
Argument | Default | Datatype | Explanation |
---|---|---|---|
method | 'slidingWindow' | str | Uses a sliding window approach to define regions (a region-based approach will be availabe soon) |
size | 5E+04 (50kb) | float | Window size. Pace is set at half the size of the window |
minSnps | 1 | int | Windows with number of SNPs lower that this threshold are not considered |
maxSnps | sp.inf | int | Windows with number of SNPs higher that this threshold are not considered |
cache | False | bool | If true, it activates the caching |
out_dir | './cache' | str | outdir of the cache file |
fname | None | str | Name of the file |
rewrite | False | bool | If true and the cache file already exists, the cache file is overwritten |
split.splitGeno(cache=True, fname='regions.h5', minSnps=4)
print '%d windows' % split.nWindows
1380 windows
In this section we showcase how to construct the mtSet class that will then be used for the set test analysis. We showcase both the full mtSet that models relatedness as random effect by the means of an individual-to-individual covariance matrix and the approximated model (mtSetPC) that models relatedness as fixed effect using principal component from the covariance.
# import phenotype and sample relatedness
pheno_file = './data/1000g/pheno.phe'
sample_relatedness_file = './data/1000g/chrom22.cov'
Y = sp.loadtxt(pheno_file)
R = sp.loadtxt(sample_relatedness_file)
# compute eigenvalues and eigenvectors of the sample relatedness matrix
S_R, U_R = scipy.linalg.eigh(R) # these are needed for the full mtSet model
# caculate fixed effects with rel
F = U_R[:, ::-1][:, :10] # it considered the first 10 PCs
F = sp.concatenate([F, sp.ones((F.shape[0], 1))], 1) # add an intercept term
if 0:
# use full mtSet implementation
# (relatedness is modeled as random effect by the means of
# an individual-to-individual covariance matrix)
mtSet = limix.MTSet(Y, S_R=S_R, U_R=U_R)
else:
# use mtSetPC
# (relatedness s modelled as fixed effect
# using principal component from the covariance)
mtSet = limix.MTSet(Y, F=F)
If the analysis is parallelized across different set of regions and permutations, it might be convenient to cache the results from the optimization of the null model (as the null model need to be optimized only once).
Argument | Default | Datatype | Explanation |
---|---|---|---|
cache | False | bool | If true, it activates the caching |
out_dir | './cache' | str | outdir of the cache file |
fname | None | str | Name of the file |
rewrite | False | bool | If true and the cache file already exists, the cache file is overwritten |
RV = mtSet.fitNull()
The returned dictionary contains:
# read fam
bim = plink_reader.readBIM(bfile,usecols=(0,1,2,3))
fam = plink_reader.readFAM(bfile,usecols=(0,1))
n_wnds = 100 # only hundred windows are considered
LLR = sp.zeros(n_wnds) # vector with test statistics of the n_wnds regions
for wnd_i in range(n_wnds):
wnd_pos = split.wnd_pos[wnd_i]
nSnps = split.nSnps[wnd_i]
idx_wnd_start = split.idx_wnd_start[wnd_i]
print('.. window %d - (%d, %d-%d) - %d snps' % (wnd_i, wnd_pos[0], wnd_pos[1], wnd_pos[2], nSnps))
Xr = plink_reader.readBED(bfile, useMAFencoding=True, start = idx_wnd_start, nSNPs = nSnps, bim=bim , fam=fam)['snps']
# multi trait set test fit
RV = mtSet.optimize(Xr)
LLR[wnd_i] = RV['LLR'][0]
.. window 0 - (22, 16025000-16075000) - 21 snps .. window 1 - (22, 16050000-16100000) - 23 snps .. window 2 - (22, 16125000-16175000) - 7 snps .. window 3 - (22, 16225000-16275000) - 9 snps .. window 4 - (22, 16250000-16300000) - 16 snps .. window 5 - (22, 16275000-16325000) - 12 snps .. window 6 - (22, 16325000-16375000) - 5 snps .. window 7 - (22, 16350000-16400000) - 5 snps .. window 8 - (22, 16475000-16525000) - 8 snps .. window 9 - (22, 16500000-16550000) - 7 snps .. window 10 - (22, 16550000-16600000) - 5 snps .. window 11 - (22, 16600000-16650000) - 13 snps .. window 12 - (22, 16625000-16675000) - 18 snps .. window 13 - (22, 16650000-16700000) - 10 snps .. window 14 - (22, 16825000-16875000) - 46 snps .. window 15 - (22, 16850000-16900000) - 75 snps .. window 16 - (22, 16875000-16925000) - 68 snps .. window 17 - (22, 16900000-16950000) - 62 snps .. window 18 - (22, 16925000-16975000) - 43 snps .. window 19 - (22, 16950000-17000000) - 22 snps .. window 20 - (22, 16975000-17025000) - 34 snps .. window 21 - (22, 17000000-17050000) - 73 snps .. window 22 - (22, 17025000-17075000) - 80 snps .. window 23 - (22, 17050000-17100000) - 71 snps .. window 24 - (22, 17075000-17125000) - 61 snps .. window 25 - (22, 17100000-17150000) - 64 snps .. window 26 - (22, 17125000-17175000) - 42 snps .. window 27 - (22, 17150000-17200000) - 9 snps .. window 28 - (22, 17175000-17225000) - 24 snps .. window 29 - (22, 17200000-17250000) - 39 snps .. window 30 - (22, 17225000-17275000) - 26 snps .. window 31 - (22, 17250000-17300000) - 55 snps .. window 32 - (22, 17275000-17325000) - 109 snps .. window 33 - (22, 17300000-17350000) - 115 snps .. window 34 - (22, 17325000-17375000) - 92 snps .. window 35 - (22, 17350000-17400000) - 77 snps .. window 36 - (22, 17375000-17425000) - 92 snps .. window 37 - (22, 17400000-17450000) - 97 snps .. window 38 - (22, 17425000-17475000) - 68 snps .. window 39 - (22, 17450000-17500000) - 54 snps .. window 40 - (22, 17475000-17525000) - 61 snps .. window 41 - (22, 17500000-17550000) - 83 snps .. window 42 - (22, 17525000-17575000) - 81 snps .. window 43 - (22, 17550000-17600000) - 72 snps .. window 44 - (22, 17575000-17625000) - 88 snps .. window 45 - (22, 17600000-17650000) - 99 snps .. window 46 - (22, 17625000-17675000) - 158 snps .. window 47 - (22, 17650000-17700000) - 156 snps .. window 48 - (22, 17675000-17725000) - 96 snps .. window 49 - (22, 17700000-17750000) - 89 snps .. window 50 - (22, 17725000-17775000) - 65 snps .. window 51 - (22, 17750000-17800000) - 77 snps .. window 52 - (22, 17775000-17825000) - 111 snps .. window 53 - (22, 17800000-17850000) - 102 snps .. window 54 - (22, 17825000-17875000) - 73 snps .. window 55 - (22, 17850000-17900000) - 69 snps .. window 56 - (22, 17875000-17925000) - 129 snps .. window 57 - (22, 17900000-17950000) - 156 snps .. window 58 - (22, 17925000-17975000) - 120 snps .. window 59 - (22, 17950000-18000000) - 91 snps .. window 60 - (22, 17975000-18025000) - 93 snps .. window 61 - (22, 18000000-18050000) - 88 snps .. window 62 - (22, 18025000-18075000) - 68 snps .. window 63 - (22, 18050000-18100000) - 112 snps .. window 64 - (22, 18075000-18125000) - 113 snps .. window 65 - (22, 18100000-18150000) - 73 snps .. window 66 - (22, 18125000-18175000) - 71 snps .. window 67 - (22, 18150000-18200000) - 58 snps .. window 68 - (22, 18175000-18225000) - 74 snps .. window 69 - (22, 18200000-18250000) - 75 snps .. window 70 - (22, 18225000-18275000) - 56 snps .. window 71 - (22, 18250000-18300000) - 77 snps .. window 72 - (22, 18275000-18325000) - 103 snps .. window 73 - (22, 18300000-18350000) - 88 snps .. window 74 - (22, 18325000-18375000) - 67 snps .. window 75 - (22, 18350000-18400000) - 63 snps .. window 76 - (22, 18375000-18425000) - 65 snps .. window 77 - (22, 18400000-18450000) - 116 snps .. window 78 - (22, 18425000-18475000) - 153 snps .. window 79 - (22, 18450000-18500000) - 124 snps .. window 80 - (22, 18475000-18525000) - 125 snps .. window 81 - (22, 18500000-18550000) - 111 snps .. window 82 - (22, 18525000-18575000) - 65 snps .. window 83 - (22, 18550000-18600000) - 50 snps .. window 84 - (22, 18575000-18625000) - 45 snps .. window 85 - (22, 18600000-18650000) - 80 snps .. window 86 - (22, 18625000-18675000) - 74 snps .. window 87 - (22, 18650000-18700000) - 17 snps .. window 88 - (22, 18675000-18725000) - 11 snps .. window 89 - (22, 18700000-18750000) - 19 snps .. window 90 - (22, 18725000-18775000) - 14 snps .. window 91 - (22, 18750000-18800000) - 11 snps .. window 92 - (22, 18775000-18825000) - 8 snps .. window 93 - (22, 18800000-18850000) - 11 snps .. window 94 - (22, 18825000-18875000) - 14 snps .. window 95 - (22, 18850000-18900000) - 42 snps .. window 96 - (22, 18875000-18925000) - 102 snps .. window 97 - (22, 18900000-18950000) - 101 snps .. window 98 - (22, 18925000-18975000) - 93 snps .. window 99 - (22, 18950000-19000000) - 78 snps
The returned dictionary from mtSet.optimize contains:
P values are obtained from a relatively small number of genome-wide permutations, fitting a parametric model to the null distribution. Here we showcase the permutation procedure by considering 10 permutations for the 10 regions analyzed.
n_perms = 10
LLR_null = [] # in this list test statistics from permutations will be stored
for perm_i in range(n_perms):
#1. generate permutation
print 'permutation %d' % perm_i
sp.random.seed(perm_i)
perm_idxs = sp.random.permutation(Y.shape[0])
#2. scan on the 100 regions
for wnd_i in range(n_wnds):
wnd_pos = split.wnd_pos[wnd_i]
nSnps = split.nSnps[wnd_i]
idx_wnd_start = split.idx_wnd_start[wnd_i]
Xr = plink_reader.readBED(bfile, useMAFencoding=True, start = idx_wnd_start, nSNPs = nSnps, bim=bim , fam=fam)['snps']
Xr = Xr[perm_idxs, :] # permute samples in region term
RV = mtSet.optimize(Xr)
LLR_null.append(RV['LLR'][0])
LLR_null = sp.array(LLR_null)
permutation 0 permutation 1 permutation 2 permutation 3 permutation 4 permutation 5 permutation 6 permutation 7 permutation 8 permutation 9
The parametric fit to the ditribution of the test statistics under the null and the consequent conversion of the observed test statistics in P values is performed by the module limix.stats.chi2mixture
as shown below.
from limix.stats.chi2mixture import Chi2mixture
c2m = Chi2mixture(tol=4e-3)
c2m.estimate_chi2mixture(LLR_null)
pv = c2m.sf(LLR)
#makes a manhattan plot
wnd_chrom = split.wnd_pos[:n_wnds,0]
wnd_start = split.wnd_pos[:n_wnds,1]
wnd_end = split.wnd_pos[:n_wnds,2]
import pylab as pl
pl.plot(wnd_start, -sp.log10(pv), 'o', color='MidnightBlue')
[<matplotlib.lines.Line2D at 0x115fe4fd0>]
[1] Genomes Project, C. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56-65 (2012).
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')