This GATK tutorial corresponds to a section of the GATK Workshop 2b. Germline Hard Filtering Tutorial worksheet. The goal is to become familiar with germline variant annotations. The notebook illustrates the following steps.
A kernel is a computational engine that executes the code in the notebook. We use Python 3 in this notebook to execute GATK commands using Python Magic (!
). Later we will switch to another notebook to do some plotting in R.
Click to select a gray cell and then pressing SHIFT+ENTER to run the cell.
Write results to /home/jupyter-user/
. To access the directory, click on the upper-left jupyter icon.
# Check if data is accessible. The command should list several gs:// URLs.
! gsutil ls gs://gatk-tutorials/workshop_1702/variant_discovery/data/resources/
! gsutil ls gs://gatk-tutorials/workshop_1702/variant_discovery/data/intervals/motherHighconf.bed
! gsutil ls gs://gatk-tutorials/workshop_1702/variant_discovery/data/inputVcfs/
# If you do not see gs:// URLs listed above, run this cell to install Google Cloud Storage.
# Afterwards, restart the kernel with Kernel > Restart.
#! pip install google-cloud-storage
Subset the trio callset to just the SNPs of the mother (sample NA12878). Make sure to remove sites for which the sample genotype is homozygous-reference and remove unused alleles, including spanning deletions.
The tool recalculates depth of coverage (DP) per site as well as the allele count in genotypes for each ALT allele (AC), allele frequency for each ALT allele (AF), and total number of alleles in called genotypes (AN), to reflect only the subset sample(s).
! gatk SelectVariants \
-V gs://gatk-tutorials/workshop_1702/variant_discovery/data/inputVcfs/trio.vcf.gz \
-sn NA12878 \
-select-type SNP \
--exclude-non-variants \
--remove-unused-alternates \
-O /home/jupyter-user/motherSNP.vcf.gz
Using GATK jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar SelectVariants -V gs://gatk-tutorials/workshop_1702/variant_discovery/data/inputVcfs/trio.vcf.gz -sn NA12878 -select-type SNP --exclude-non-variants --remove-unused-alternates -O /home/jupyter-user/motherSNP.vcf.gz 05:34:16.167 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so 05:34:16.395 INFO SelectVariants - ------------------------------------------------------------ 05:34:16.396 INFO SelectVariants - The Genome Analysis Toolkit (GATK) v4.1.0.0 05:34:16.396 INFO SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/ 05:34:16.396 INFO SelectVariants - Executing as jupyter-user@saturn-f5de5fcd-6bbf-4424-8ae2-0ef73d2e5490-m on Linux v4.9.0-8-amd64 amd64 05:34:16.396 INFO SelectVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_201-b09 05:34:16.397 INFO SelectVariants - Start Date/Time: March 21, 2019 5:34:16 AM UTC 05:34:16.397 INFO SelectVariants - ------------------------------------------------------------ 05:34:16.397 INFO SelectVariants - ------------------------------------------------------------ 05:34:16.398 INFO SelectVariants - HTSJDK Version: 2.18.2 05:34:16.398 INFO SelectVariants - Picard Version: 2.18.25 05:34:16.399 INFO SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2 05:34:16.399 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 05:34:16.399 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 05:34:16.399 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 05:34:16.399 INFO SelectVariants - Deflater: IntelDeflater 05:34:16.400 INFO SelectVariants - Inflater: IntelInflater 05:34:16.400 INFO SelectVariants - GCS max retries/reopens: 20 05:34:16.400 INFO SelectVariants - Requester pays: disabled 05:34:16.400 INFO SelectVariants - Initializing engine 05:34:21.549 INFO FeatureManager - Using codec VCFCodec to read file gs://gatk-tutorials/workshop_1702/variant_discovery/data/inputVcfs/trio.vcf.gz 05:34:24.616 INFO SelectVariants - Done initializing engine 05:34:24.683 INFO SelectVariants - Including sample 'NA12878' 05:34:24.709 INFO ProgressMeter - Starting traversal 05:34:24.710 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute 05:34:34.337 INFO ProgressMeter - 20:62910954 0.2 122091 760928.6 05:34:34.338 INFO ProgressMeter - Traversal complete. Processed 122091 total variants in 0.2 minutes. 05:34:34.354 INFO SelectVariants - Shutting down engine [March 21, 2019 5:34:34 AM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.30 minutes. Runtime.totalMemory()=935854080
# Peruse the resulting file
! zcat /home/jupyter-user/motherSNP.vcf.gz | grep -v '##' | head
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 20 61098 . C T 465.13 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.516;ClippingRankSum=0.00;DP=44;ExcessHet=3.0103;FS=0.000;MQ=59.48;MQRankSum=0.803;QD=10.57;ReadPosRankSum=1.54;SOR=0.603 GT:AD:DP:GQ:PL 0/1:28,16:44:99:496,0,938 20 61795 . G T 2034.16 . AC=1;AF=0.500;AN=2;BaseQRankSum=-6.330e-01;ClippingRankSum=0.00;DP=60;ExcessHet=3.9794;FS=0.000;MQ=59.81;MQRankSum=0.00;QD=17.09;ReadPosRankSum=1.23;SOR=0.723 GT:AD:DP:GQ:PL 0/1:30,30:60:99:1003,0,1027 20 63244 . A C 923.13 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.637;ClippingRankSum=0.00;DP=57;ExcessHet=3.0103;FS=5.470;MQ=59.60;MQRankSum=-1.019e+00;QD=16.20;ReadPosRankSum=0.404;SOR=1.528 GT:AD:DP:GQ:PL 0/1:30,27:57:99:954,0,1064 20 63799 . C T 1766.16 . AC=1;AF=0.500;AN=2;BaseQRankSum=-6.530e-01;ClippingRankSum=0.00;DP=45;ExcessHet=3.9794;FS=0.000;MQ=59.78;MQRankSum=1.22;QD=16.98;ReadPosRankSum=-1.075e+00;SOR=0.709 GT:AD:DP:GQ:PL 0/1:19,26:45:99:953,0,670 20 65900 . G A 5817.13 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.503;ClippingRankSum=0.00;DP=64;ExcessHet=3.0103;FS=4.289;MQ=59.65;MQRankSum=0.00;QD=31.61;ReadPosRankSum=0.732;SOR=1.032 GT:AD:DP:GQ:PL 0/1:41,23:64:99:809,0,1596 20 66370 . G A 5611.13 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.26;ClippingRankSum=0.00;DP=52;ExcessHet=3.0103;FS=6.196;MQ=60.00;MQRankSum=0.00;QD=33.01;ReadPosRankSum=-9.100e-02;SOR=0.647 GT:AD:DP:GQ:PL 0/1:31,21:52:99:716,0,1103 20 66720 . C A 2204.16 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.663;ClippingRankSum=0.00;DP=59;ExcessHet=3.9794;FS=12.193;MQ=60.00;MQRankSum=0.00;QD=15.86;ReadPosRankSum=-1.250e-01;SOR=1.219 GT:AD:DP:GQ:PL 0/1:31,28:59:99:948,0,1123 20 68749 . T C 4285.16 . AC=2;AF=1.00;AN=2;BaseQRankSum=0.613;ClippingRankSum=0.00;DP=52;ExcessHet=3.9794;FS=2.137;MQ=59.86;MQRankSum=0.00;QD=26.29;ReadPosRankSum=1.24;SOR=0.924 GT:AD:DP:GQ:PL 1/1:0,52:52:99:2214,157,0 20 70980 . G A 672.13 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.800e-01;ClippingRankSum=0.00;DP=53;ExcessHet=3.0103;FS=1.042;MQ=59.65;MQRankSum=-1.196e+00;QD=12.68;ReadPosRankSum=0.444;SOR=0.848 GT:AD:DP:GQ:PL 0/1:32,21:53:99:703,0,1198 grep: write error: Broken pipe gzip: stdout: Broken pipe
We use VariantAnnotator to annotate which variants in our callset are also present in the truthset (GIAB), which are considered true positives. Variants not present in the truthset are considered false positives. Here we produce a callset where variants that are present in the truthset are annotated with the giab.callsets annotation plus a value indicating how many of the callsets used to develop the truthset agreed with that call.
! gatk VariantAnnotator \
-V /home/jupyter-user/motherSNP.vcf.gz \
--resource:giab gs://gatk-tutorials/workshop_1702/variant_discovery/data/resources/motherGIABsnps.vcf.gz \
-E giab.callsets \
-O /home/jupyter-user/motherSNP.giab.vcf.gz
# Peruse the resulting file
! zcat /home/jupyter-user/motherSNP.giab.vcf.gz | grep -v '##' | head
Convert the information from the callset into a tab delimited table using VariantsToTable, so that we can parse it easily in R. The tool parameters differentiate INFO/site-level fields fields (-F
) and FORMAT/sample-level fields genotype fields (-GF
). This step produces a table where each line represents a variant record from the VCF, and each column represents an annotation we have specified. Wherever the requested annotations are not present, e.g. RankSum annotations at homozygous sites, the value will be replaced by NA.
! gatk VariantsToTable \
-V /home/jupyter-user/motherSNP.giab.vcf.gz \
-F CHROM -F POS -F QUAL \
-F BaseQRankSum -F MQRankSum -F ReadPosRankSum \
-F DP -F FS -F MQ -F QD -F SOR \
-F giab.callsets \
-GF GQ \
-O /home/jupyter-user/motherSNP.giab.txt
# Peruse the resulting file
! cat /home/jupyter-user/motherSNP.giab.txt | head -n300
# Focus in on a few columns
! cat /home/jupyter-user/motherSNP.giab.txt | cut -f1,2,7,12 | head -n300
Load the R notebook now to run the plots for this next section. Continue below only after you've finished with the other notebook.
Based on the plots we generated, we're going to apply some filters to weed out false positives. To illustrate how VariantFiltration works, and to establish baseline performance, we first filter on QUAL < 30. By default, GATK GenotypeGVCFs filters out variants with QUAL < 10. This step produces a VCF with all the original variants; those that failed the filter are annotated with the filter name in the FILTER column.
# Filter callset on one annotation, QUAL < 30
! gatk VariantFiltration \
-R gs://gatk-tutorials/workshop_1702/variant_discovery/data/ref/ref.fasta \
-V /home/jupyter-user/motherSNP.vcf.gz \
--filter-expression "QUAL < 30" \
--filter-name "qual30" \
-O /home/jupyter-user/motherSNPqual30.vcf.gz
# Peruse the results; try adding 'grep "qual30"'
! zcat /home/jupyter-user/motherSNPqual30.vcf.gz | grep -v '##' | head -n10
# Calculate concordance metrics using GATK4 BETA tool Concordance
! gatk Concordance \
-eval /home/jupyter-user/motherSNPqual30.vcf.gz \
-truth gs://gatk-tutorials/workshop_1702/variant_discovery/data/resources/motherGIABsnps.vcf.gz \
-L gs://gatk-tutorials/workshop_1702/variant_discovery/data/intervals/motherHighconf.bed \
-S /home/jupyter-user/motherSNPqual30.txt
# View the results
! echo ""
! cat /home/jupyter-user/motherSNPqual30.txt
To filter on multiple expressions, provide each in separate expression. For INFO level annotations, the parameter is -filter
, which should be immediately followed by the corresponding –-filter-name
label. Here we show basic hard-filtering thresholds.
--missing-values-evaluate-as-failing
flag.--genotype-filter-expression
and --genotype-filter-name
.# Filter callset on multiple annotations.
# Iterate on thresholds to improve precision while maintaining high sensitivity.
! gatk VariantFiltration \
-V /home/jupyter-user/motherSNP.vcf.gz \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40" \
-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
-O /home/jupyter-user/motherSNPfilters.vcf.gz
# Sanity-check that filtering is as expected by examining filtered records and PASS records.
! zcat /home/jupyter-user/motherSNPfilters.vcf.gz | grep -v '##' | grep -v 'PASS' | head -n20 | cut -f6-10
! zcat /home/jupyter-user/motherSNPfilters.vcf.gz | grep -v '#' | grep 'PASS' | head | cut -f6-10
# Calculate concordance metrics using GATK4 BETA tool Concordance
! gatk Concordance \
-eval /home/jupyter-user/motherSNPfilters.vcf.gz \
-truth gs://gatk-tutorials/workshop_1702/variant_discovery/data/resources/motherGIABsnps.vcf.gz \
-L gs://gatk-tutorials/workshop_1702/variant_discovery/data/intervals/motherHighconf.bed \
-S /home/jupyter-user/motherSNPfilters.txt
#Now lets re-run concordance from just using QUAL filtering first
!cat /home/jupyter-user/motherSNPqual30.txt
# View the results from filtering on multiple annotations
! echo ""
! cat /home/jupyter-user/motherSNPfilters.txt
We performed hard-filtering to learn about germline variant annotations. Remember that GATK recommends Variant Quality Score Recalibration (VQSR) for germline variant callset filtering. For more complex variant filtering and annotation, see the Broad Hail.is framework.