Bioinformatics/Tools example

뭉치 2013. 7. 15. 07:42


In this exercise we will use the processed bam-file and call SNPs from it. We will use two different SNP callers, Samtools/bcftools and GATK. We will look at the output (VCF-format) and use vcftools to compare the SNP calls and integrate with dbSNP information.

  1. SNP calling using Unified Genotyper (GATK)
  2. Be acquainted with VCF-files
  3. Soft filtering (Variant recalibration)
  4. Hard filtering
  5. Heterozygotes vs Homozygotes
  6. Comparison between UnifiedGenotyper and HaplotypeCaller (Optional)
  7. Analysis of SNP consequences (Optional)
  8. Samtools and bcftools (Optional)


Create a directory and copy the files needed for the exercise there. You need the HG00418 bamfile you created during alignment preprocessing, if you do not have it you can get copy it from here /home/local/27626/exercises/snp_calling/HG00418.sort.recal.baq.rmdup.bam .

mkdir snp_calling
cd snp_calling
cp /home/local/27626/exercises/snp_calling/chr21.* .
ln -s /home/local/27626/bin/GenomeAnalysisTK-2.5-2/GenomeAnalysisTK.jar .

# copy your HG00418.sort.rmdup.realn.recal.bam to here #
# if you dont have one you can copy it from here: #
cp /home/local/27626/exercises/snp_calling/HG00418.sort.rmdup.realn.recal.* .

The HaplotypeCaller and UnifiedGenotyper (GATK)

The current state-of-the-art genotyper is the HaplotypeCaller by the GATK. The HaplotypeCaller will perform a local de novo assembly around each of the potential variants in the alignment file and then output both SNPs and indels with very high accuracy. However it is in active development and still quite slow. For the sake of time we will then call SNPs using the UnifiedGenotyper which is almost as accurate but faster and also made by the GATK. In the command section below you will also see code for the HaplotypeCaller which you should use if you have more time than we do now for the exercise. When I tested the exercise the UnifiedGenotyper took ~5 minutes (using 4 cores) and the HaplotypeCaller took 45 minutes.

Lets start the SNP calling using the UnifiedGenotyper by running the GATK using "-T UnifiedGenotyper". We add a extra options: "--dbsnp" this will annotate our SNPs with know human variation ids, the output file "-o" which is output in VCF format, "-stand_call_conf" which is minimum Phred quality value of SNP to pass filtering, and "-stand_emit_conf" which is the minimum Phred quality value of the SNP to be reported at all. If we also want to call indels then you need to add the option "-glm BOTH", but for the exercises let us just call snps

java -XX:+UseParallelGC -XX:ParallelGCThreads=8 -Xmx2g -jar GenomeAnalysisTK.jar \
   -R chr21.fasta \
   -T UnifiedGenotyper \
   -I HG00418.sort.rmdup.realn.recal.bam \
   --dbsnp /home/local/27626/exercises/snp_calling/dbSNP135.snps.sites.vcf.gz \
   -o HG00418.snps.raw.vcf \
   -nct 4 \
   -stand_call_conf 30.0 \
   -stand_emit_conf 10.0

java -XX:+UseParallelGC -XX:ParallelGCThreads=8 -Xmx2g -jar GenomeAnalysisTK.jar \
   -R chr21.fasta \
   -T HaplotypeCaller \
   -I HG00418.sort.rmdup.realn.recal.bam \
   --dbsnp /home/local/27626/exercises/snp_calling/dbSNP135.snps.sites.vcf.gz \
   -o HG00418.haplotyper.raw.vcf \
   -stand_call_conf 30.0 \
   -stand_emit_conf 10.0

Open the output with the raw snp calls (given by -o). You will see that the file consists of two main sections, header denoted by "#" as the first character and the actual SNP calls. The header gives you information about all the abbreviations in the file and is quite handy. The columns in the variant part is: Chromosome, Position, dbSNP id, Reference base, Alternative base, Phred quality of call, Filter status, Lots of info, Format of the next field and then genotype information for our sample (HG00148). Lets try to see how many SNP calls we got and how many that pass the 30 quality filter:

less -S HG00418.snps.raw.vcf
grep -v "#" -c HG00418.snps.raw.vcf
grep -v "#" HG00418.snps.raw.vcf | perl -ane 'if ($F[5] > 30) { print $_}' | wc -l

Q1. How many SNP calls did we get and how many passed (>30 Qual)?

Variant Quality recalibration / Soft filtering

The raw SNP calls are likely to contain false positive calls made from biases in the sequencing data. We will try to recalibrate the variant qualities of our SNPs. First we lets create shortcuts to the datasets we are going to use for calibration.

ln -s /home/local/27626/exercises/snp_calling/hapmap_3.3.b37.sites.vcf.gz .
ln -s /home/local/27626/exercises/snp_calling/hapmap_3.3.b37.sites.vcf.gz.tbi .
ln -s /home/local/27626/exercises/snp_calling/1000G_omni2.5.b37.sites.vcf.gz .
ln -s /home/local/27626/exercises/snp_calling/1000G_omni2.5.b37.sites.vcf.gz.tbi .
ln -s /home/local/27626/exercises/snp_calling/dbsnp_135.b37.vcf.gz .
ln -s /home/local/27626/exercises/snp_calling/dbsnp_135.b37.vcf.gz.tbi .

Now we are ready for recalibration. Because we have human data we are going to use information from Hapmap and the Omni-chip to recalibrate our SNPs. Additionally because we only have a small data-set (only chr21) we are going to add the command "--maxGaussians 4" - if you have full genome data this is not needed. Similar we leave out dbSNP information, but it can be added ("-resource:dbsnp,known=true,training=false,truth=false,prior=8.0 dbsnp_135.b37.vcf \ ") - the dbSNP information is only used for visualizing the results. We are going to output some plots in the var_recal directory.

mkdir var_recal
java -XX:+UseParallelGC -XX:ParallelGCThreads=8  -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
   -R chr21.fasta \
   -input HG00418.snps.raw.vcf \
   -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf.gz \
   -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf.gz \
   -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an MQ \
   -recalFile var_recal/HG00418.recal \
   -tranche 100 -tranche 99.9 -tranche 99.0 -tranche 90 \
   -tranchesFile var_recal/HG00418.tranches \
   -rscriptFile var_recal/HG00418.plots.R \
   -mode SNP \
   --maxGaussians 4  

If you cannot run the tranche command due to an error copy my files

cp /home/local/ngs_course/stud100/snp_calling/var_recal/HG00418.* var_recal/

Open the tranche-plot. The plot shows that the number of variants (x-axis, here novel=all because we did not use dbSNP information) and whether they are predicted to be True Positives (TP) or False Positives (FP). The "truth" number defines the trances, so that "90" means that we accept all variants until we reach 90% of our truth set, "99" means that we accept all variants until we have reached 99% of our truth set and so on. Blue are predicted to be true SNPs and orange to be false SNPs. We use this plot to decide where to put the threshold for filtering, eg. how many more false SNPs are you likely to get by increasing from 99.0 to 99.9. Unfortunately sometimes the tranches are ordered wrongly so it becomes a little hard to see.

acroread var_recal/HG00418.tranches.pdf &

Q2. At which tranche would you put the threshold?

Note that if we are not using human data it is possible to do the recalibration step, but then we can use the most significant SNPs in our data to recalibrate our SNP qualities .

Apply calibration to SNPs

Lets apply the recalibration to our SNP calls. Lets try with a threshold of 99.0

java -XX:+UseParallelGC -XX:ParallelGCThreads=8  -Xmx3g -jar GenomeAnalysisTK.jar \
   -T ApplyRecalibration \
   -R chr21.fasta \
   -input HG00418.snps.raw.vcf \
   --ts_filter_level 99.0 \
   -tranchesFile var_recal/HG00418.tranches \
   -recalFile var_recal/HG00418.recal \
   -o HG00418.snps.recal.filtered.vcf

Open the recalibrated file and check the "Filter" column, you should now see that all SNPs that were classified at 99.0 or better are all termed "PASS" and the rest are termed "LowQual", "TruthSensitivityTranche99.00to99.90" or "TruthSensitivityTranche99.90to100.00". We can now get all the trusted SNPs by taking the ones with "PASS".

less -S HG00418.snps.recal.filtered.vcf
grep "#" HG00418.snps.recal.filtered.vcf > header.vcf
grep "PASS" HG00418.snps.recal.filtered.vcf | cat header.vcf - > HG00418.snps.recal.pass.vcf
grep -c "PASS" HG00418.snps.recal.filtered.vcf

Q3. How many SNPs do we have that PASS the filtering?

However keep in mind that the even at the 99.0% sensitivity tranche we expect quite a number of false positive calls (the orange and orange scribbled bars in the plot). Lets try to divide the calls into known and novel (based on dbSNP) and plot the quality distributions using R.

perl -ane 'if ($_ =~ m/^#/) {} else {if ($F[2] eq ".") {print $F[5], "\n"}}' HG00418.snps.recal.pass.vcf > novel.qual.txt
perl -ane 'if ($_ =~ m/^#/) {} else {if ($F[2] ne ".") {print $F[5], "\n"}}' HG00418.snps.recal.pass.vcf > known.qual.txt

n = read.table("novel.qual.txt")
k = read.table("known.qual.txt")
plot(density(k[,1]), xlim=c(0,500), ylim=c(0,0.01), xlab="Variant Quality", main="Known vs Novel SNPs")
points(density(n[,1]), type="l", col="red")
legend("topright", legend=c("known", "novel"), col=c("black", "red"), lty=1)
dev.print("variants.known_vs_novel.pdf", device=pdf)

Q4. Based on this, would you trust all of the novel SNPs?
Q5. Why do you think that it looks like this? Hint: dbSNP contains information from different populations, including Asians (HG00418 is Han Chinese).

We could remove all novel SNPs with lower quality than eg 70 (this would be to apply hard-filtering).

perl -ane 'if ($_ =~ m/^#/) {print $_} else {if ($F[2] eq "." and $F[5] < 70) {} else {print $_}}' \
HG00418.snps.recal.pass.vcf > HG00418.snps.recal.pass.novel_filt.vcf
grep -v "#" -c HG00418.snps.recal.pass.novel_filt.vcf

Q6. How many SNPs do we then have and how many novel did we remove (hint: Q3-current number)

Hard Filtering

Hard filtering is setting specific thresholds for SNP properties and then removing all SNPs that does not fullfil these criteria. Eg, we could set a threshold of minimum depth of 5, minimum quality of 70 and disallow any variant calls (both SNPs and indels) within 5 bases of each other. The minimum depth is of course dependent on the total depth of your genome - if you have a high coverage genome, I would probably set it to minimum 10. The filtering can be done using vcf-annotate from the Vcftools package. Additional filtering options can also be found here. Hard filtering is sometimes the only solution if we dont have an extensive variation database for our particular species. Lets try this on the raw SNP calls:

vcf-annotate --filter d=5/Q=70/c=2,5 HG00418.snps.raw.vcf > HG00418.snps.hard_filt.vcf

This annotates the Filter column with extra values if a SNP fails a criteria and puts "PASS" for the SNPs that pass the criteria. Lets see how many we get and compare the results with the soft calibration we did before. To be fair lets also set a minimum depth of 5 for the soft calibrated SNP calls.

grep "#" HG00418.snps.hard_filt.vcf > header.vcf
grep "PASS" HG00418.snps.hard_filt.vcf | cat header.vcf - > HG00418.snps.hard_filt.pass.vcf

vcf-annotate --filter d=5 HG00418.snps.recal.pass.novel_filt.vcf > HG00418.snps.recal.pass.novel_filt.hard.vcf
grep "#" HG00418.snps.recal.pass.novel_filt.hard.vcf > header.vcf
grep "PASS" HG00418.snps.recal.pass.novel_filt.hard.vcf | cat header.vcf - > HG00418.snps.recal.pass.novel_filt.hard_pass.vcf

Now lets count the number of SNPs and lets see how many that overlap using bedtools (it works on vcf-files too!)

grep -c -v "#" HG00418.snps.hard_filt.pass.vcf
grep -c -v "#" HG00418.snps.recal.pass.novel_filt.hard_pass.vcf
bedtools intersect -a HG00418.snps.hard_filt.pass.vcf -b HG00418.snps.recal.pass.novel_filt.hard_pass.vcf | wc -l

Q7. How many SNPs did we get from the hard filtering, the soft filtering+depth5 and how many overlapped?

Heterozygotes vs Homozygotes

For a haploid chromosome, ie a bacterial genome there can only be two states - either there is an alternative allele or it is the reference base. We can classify a position as either homozygote reference or homozygote variant position. However for diploid chromosomes, such as the human chromosomes 1-22, there are the possibility to heterozygotes. Eg, one allele is an "A" and the other chromosome in the pair is a different base, eg "C". This is then a heterozygote. A position with the same base on both alleles is called a homozygote, either homozygote reference or homozygote variant. Whether a variant is called as a heterozygote or homozygote is encoded in the vcf file. Open the soft filtered vcf-file and look at the rightmost column. Here the first three characters encode the status, 0/0 means reference base on both alleles, 0/1 means one reference and one alternative allele (a heterozygote) and 1/1 meaning alternative allele for both chromosomes. Lets counts the number of heterozygote and homozygote variants.

less -S HG00418.snps.recal.pass.novel_filt.vcf
grep -c "0/1" HG00418.snps.recal.pass.novel_filt.vcf
grep -c "1/1" HG00418.snps.recal.pass.novel_filt.vcf

Q8. How many heterozygotes and homozygotes do we have?

If you have haploid chromosomes (bacteria, chrX and chrY in human males, mtDNA) it is important to remove all heterozygote variant calls, these cant be true and will arise from mapping and variant calling errors.

Open IGV and load select the Human (b37) genome in the top-left box. Then load the bam-file ("HG00418.sort.rmdup.realn.recal.bam"). Now we will also load in the vcf-file - chose Load from file and select "HG00418.snps.recal.filtered.vcf". You can also open the vcf file and try to find both heterozygote and homozygote SNP calls. This can be done by coping the position (eg. 9414773) and paste it into IGV search box, eg. "21:9414773".

java -XX:+UseParallelGC -XX:ParallelGCThreads=8  -jar -Xmx1250m /home/local/27626/bin/IGV/igv.jar &
less -S HG00418.snps.recal.pass.novel_filt.vcf

Comparison of UnifiedGenotyper and HaplotypeCaller (Optional)

If you are interested you can compare the calls from the UG vs. the HaplotypeCaller. Copy the haplotyper calls that I already made here and use bedtools intersect to do the comparison, first count the number of SNP calls made by the haplotyper (the grep command below). Then try to write the code for the comparison.

cp /home/local/27626/exercises/snp_calling/HG00418.haplotyper.snps.vcf* .
grep -v "#" -c HG00418.haplotyper.snps.vcf
bedtools intersect ...

Q9. What is the overlap between the UG and the HC? What does this tell you about the UG vs HC?

Analysis of SNP consequences (Optional)

So now we identified the SNPs - what consequence or effect do they have on the proteins of this individual? This is not included in this exercise, but if you are interested you can try it out in another exercise linked here. Scroll down to the section on "Identify cancer causing mutation" and exchange the "cancer1.snps.recal.pass.vcf" with one of your SNP files.

Samtools and bcftools (Optional)

An alternative and often more straightforward way of calling SNPs is to use samtools. Here we will try to call SNPs using samtools and bcftools.

samtools mpileup -uf /home/local/27626/exercises/snp_calling/chr21.fasta HG00418.sort.rmdup.realn.recal.bam | \
bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | varFilter > var.flt.vcf 

Open the vcf file using less and you should see that it also outputs a VCF-file and that indels are also called (Marked by "INDEL" in the beginning of the INFO field). Lets try to hard-filter this file as we did with the GAKT calls (it outputs a lot of warnings when you run it). Because vcf-annotate does not work correctly on samtools vcf-files when filtering quality we will use the perl-oneliner to do that first.

less -S var.flt.vcf 
grep "#" var.flt.vcf > header.vcf
perl -ne 'if ($_ =~ m/DP=(\d+)/) {if ($1 >= 5){print $_}}' var.flt.vcf | cat header.vcf - > var.flt.dp.vcf
vcf-annotate --filter c=2,5 var.flt.dp.vcf> var.flt.hard.vcf
grep "#" var.flt.hard.vcf > header.vcf
grep "PASS" var.flt.hard.vcf | grep -v "INDEL" | cat header.vcf - > var.flt.hard.pass.vcf

Lets compare the SNP calls between the two using bedtools.

grep -c -v "#" var.flt.hard.pass.vcf
grep -c -v "#" HG00418.snps.recal.pass.novel_filt.vcf 
bedtools intersect -a HG00418.snps.recal.pass.novel_filt.vcf -b var.flt.hard.pass.vcf | wc -l

Q10.How many SNPs were called by samtools (first number) and what is the overlap with the GATK snps (last number)?

Congratulations you finished the exercise!