这个一个肿瘤外显子项目的文章发表并且公布的公共数据,我这里给出全套分析流程代码。只需要你肯实践,就可以运行成功。
PS:有些后起之秀自己运营公众号或者博客喜欢批评我们这些老人,一味的堆砌代码不给解释,恶意揣测我们是因为不懂代码的原理。我表示很无语,我写了3千多篇教程,如果一篇篇都重复提到基础知识,我真的做不到。比如下面的流程,包括软件的用法,软件安装,注释数据库的下载,我博客都说过好几次了,直播我的基因组系列也详细解读过,我告诉你去哪里学,你却不珍惜,不当回事,呵呵。
文章解读
paper;https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4812767/a whole-exome sequencing (WES) study was performed in 161 NPC cases and 895 controls of Southern Chinese descent
测序概述:
We sequenced the blood samples from 161 NPC cases, including 39 EAO cases, 63 FH+ cases from 52 independent families, and 59 sporadic cases by WES and achieved an average coverage of 49-fold on target (range of 32- to 76-fold)
后来还扩大了验证样本集:
An additional 2,160 NPC cases and 2,433 healthy controls from Hong Kong were further examined for the selected candidate variants.
数据全部上传了:
Whole-exome sequencing data for the early-age onset cases have been deposited in the Sequence Read Archive (SRA) database (accession ID SRA291701).
了解WES数据分析步骤:2016-A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data
选择部分数据如下:
随便选择了5个样本,可以在SRA数据库下载拿到这些信息(这个小技巧自行探索,或者看我B站教学视频)其ID号及描述如下,
SRX445405 MALE NPC15 SRR1139956 NPC15F NO SRS540548 NPC15F-T SRX445406 MALE NPC15 SRR1139958 NPC15F NO SRS540549 NPC15F-N SRX445407 MALE NPC29 SRR1139966 NPC29F YES SRS540550 NPC29F-T SRX445408 MALE NPC29 SRR1139973 NPC29F YES SRS540551 NPC29F-N SRX445409 FEMALE NPC10 SRR1139999 NPC10F NO SRS540552 NPC10F-T SRX445410 FEMALE NPC10 SRR1140007 NPC10F NO SRS540553 NPC10F-N SRX445411 FEMALE NPC34 SRR1140015 NPC34F NO SRS540554 NPC34F-T SRX445412 FEMALE NPC34 SRR1140023 NPC34F NO SRS540555 NPC34F-N SRX445413 MALE NPC37 SRR1140044 NPC37F YES SRS540556 NPC37F-T SRX445414 MALE NPC37 SRR1140045 NPC37F YES SRS540557 NPC37F-N
从SRA数据库下载并转换为fastq测序数据文件
把上面的描述文本存为文件npc.sra.txt下载,脚本如下:
cat npc.sra.txt | cut -f 4|while read id
do echo $id
wget -c ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP035/SRP035573/$id/$id.sra
done
上面的代码是3年前的,目前大家都用sratoolkit的prefectch命令来根据id号直接下载,并不需要自行拼凑下载链接了,现在学生信真的是越来越幸福了,一切从简,记得看我B站教学视频:https://space.bilibili.com/338686099/#/
转换脚本如下:
cat npc.sra.txt | while read id
do
array=($id)
echo ${array[3]}.sra ${array[7]}
~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/fastq-dump --gzip --split-3 -A \
${array[7]} ${array[3]}.sra
done
得到的fastq文件如下:
3.5G Aug 26 09:48 NPC10F-N_1.fastq.gz 3.6G Aug 26 09:48 NPC10F-N_2.fastq.gz 3.2G Aug 26 09:48 NPC10F-T_1.fastq.gz 3.3G Aug 26 09:48 NPC10F-T_2.fastq.gz 2.7G Aug 26 09:47 NPC15F-N_1.fastq.gz 2.8G Aug 26 09:47 NPC15F-N_2.fastq.gz 2.8G Aug 26 09:47 NPC15F-T_1.fastq.gz 2.9G Aug 26 09:47 NPC15F-T_2.fastq.gz 2.8G Aug 26 09:47 NPC29F-N_1.fastq.gz 2.9G Aug 26 09:47 NPC29F-N_2.fastq.gz 2.4G Aug 26 09:47 NPC29F-T_1.fastq.gz 2.5G Aug 26 09:47 NPC29F-T_2.fastq.gz 2.0G Aug 26 09:47 NPC34F-N_1.fastq.gz 2.0G Aug 26 09:47 NPC34F-N_2.fastq.gz 2.2G Aug 26 09:48 NPC34F-T_1.fastq.gz 2.3G Aug 26 09:48 NPC34F-T_2.fastq.gz 2.1G Aug 26 09:47 NPC37F-N_1.fastq.gz 2.1G Aug 26 09:47 NPC37F-N_2.fastq.gz 2.2G Aug 26 09:46 NPC37F-T_1.fastq.gz 2.2G Aug 26 09:46 NPC37F-T_2.fastq.gz
简单的走一下fastqc+multiqc 看看数据质量,一般都会很不错的,这个数据也不例外。
ls *.gz |xargs ~/biosoft/fastqc/FastQC/fastqc -o ./ -t 5
但是值得注意的是本文章中的测序reads的编码格式是phred+64 而不是传统的33
然后走WES的标准SNP-calling流程
选用的是经典的GATK best practice的流程,代码如下:
module load java/1.8.0_91
GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta
INDEX=/home/jianmingzeng/reference/index/bwa/human_g1k_v37
GATK=/home/jianmingzeng/biosoft/GATK/GenomeAnalysisTK.jar
PICARD=/home/jianmingzeng/biosoft/picardtools/2.9.2/picard.jar
DBSNP=/home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz
SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz
INDEL=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz
KG_SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz
Mills_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf
KG_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf
TMPDIR=/home/jianmingzeng/tmp/software
## samtools and bwa are in the environment
## samtools Version: 1.3.1 (using htslib 1.3.1)
## bwa Version: 0.7.15-r1140
#arr=($1)
#fq1=${arr[0]}
#fq2=${arr[1]}
#sample=${arr[2]}
fq1=$1
fq2=$2
sample=$3
#####################################################
################ Step 1 : Alignment #################
#####################################################
start=$(date +%s.%N)
echo bwa `date`
bwa mem -t 5 -M -R "@RG\tID:$sample\tSM:$sample\tLB:WES\tPL:Illumina" $INDEX $fq1 $fq2 1>$sample.sam 2>/dev/null
echo bwa `date`
dur=$(echo "$(date +%s.%N) - $start" | bc)
printf "Execution time for BWA : %.6f seconds" $dur
echo
#####################################################
################ Step 2: Sort and Index #############
#####################################################
start=$(date +%s.%N)
echo SortSam `date`
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $PICARD SortSam SORT_ORDER=coordinate INPUT=$sample.sam OUTPUT=$sample.bam
samtools index $sample.bam
echo SortSam `date`
dur=$(echo "$(date +%s.%N) - $start" | bc)
printf "Execution time for SortSam : %.6f seconds" $dur
echo
rm $sample.sam
#####################################################
################ Step 3: Basic Statistics ###########
#####################################################
start=$(date +%s.%N)
echo stats `date`
samtools flagstat $sample.bam > ${sample}.alignment.flagstat
samtools stats $sample.bam > ${sample}.alignment.stat
echo plot-bamstats -p ${sample}_QC ${sample}.alignment.stat
echo stats `date`
dur=$(echo "$(date +%s.%N) - $start" | bc)
printf "Execution time for Basic Statistics : %.6f seconds" $dur
echo
#####################################################
####### Step 4: multiple filtering for bam files ####
#####################################################
###MarkDuplicates###
start=$(date +%s.%N)
echo MarkDuplicates `date`
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $PICARD MarkDuplicates \
INPUT=$sample.bam OUTPUT=${sample}_marked.bam METRICS_FILE=$sample.metrics
echo MarkDuplicates `date`
dur=$(echo "$(date +%s.%N) - $start" | bc)
printf "Execution time for MarkDuplicates : %.6f seconds" $dur
echo
rm $sample.bam
###FixMateInfo###
start=$(date +%s.%N)
echo FixMateInfo `date`
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $PICARD FixMateInformation \
INPUT=${sample}_marked.bam OUTPUT=${sample}_marked_fixed.bam SO=coordinate
samtools index ${sample}_marked_fixed.bam
echo FixMateInfo `date`
dur=$(echo "$(date +%s.%N) - $start" | bc)
printf "Execution time for FixMateInfo : %.6f seconds" $dur
echo
rm ${sample}_marked.bam
#####################################################
####### Step 5: gatk process bam files ##############
#####################################################
### SplitNCigar ###
start=$(date +%s.%N)
echo SplitNCigar `date`
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T SplitNCigarReads \
-R $GENOME -I ${sample}_marked_fixed.bam -o ${sample}_marked_fixed_split.bam \
-rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
#--fix_misencoded_quality_scores
## --fix_misencoded_quality_scores only if phred 64
echo SplitNCigar `date`
dur=$(echo "$(date +%s.%N) - $start" | bc)
printf "Execution time for SplitNCigar : %.6f seconds" $dur
echo
rm ${sample}_marked_fixed.bam
# rm ${sample}.sam ${sample}.bam ${sample}_marked.bam ${sample}_marked_fixed.bam
###RealignerTargetCreator###
start=$(date +%s.%N)
echo RealignerTargetCreator `date`
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T RealignerTargetCreator \
-I ${sample}_marked_fixed_split.bam -R $GENOME -o ${sample}_target.intervals \
-known $Mills_indels -known $KG_indels -nt 5
echo RealignerTargetCreator `date`
dur=$(echo "$(date +%s.%N) - $start" | bc)
printf "Execution time for RealignerTargetCreator : %.6f seconds" $dur
echo
###IndelRealigner###
start=$(date +%s.%N)
echo IndelRealigner `date`
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T IndelRealigner \
-I ${sample}_marked_fixed_split.bam -R $GENOME -targetIntervals ${sample}_target.intervals \
-o ${sample}_realigned.bam -known $Mills_indels -known $KG_indels
echo IndelRealigner `date`
dur=$(echo "$(date +%s.%N) - $start" | bc)
printf "Execution time for IndelRealigner : %.6f seconds" $dur
echo
rm ${sample}_marked_fixed_split.bam
###BaseRecalibrator###
start=$(date +%s.%N)
echo BaseRecalibrator `date`
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T BaseRecalibrator \
-I ${sample}_realigned.bam -R $GENOME -o ${sample}_temp.table -knownSites $DBSNP
echo BaseRecalibrator `date`
dur=$(echo "$(date +%s.%N) - $start" | bc)
printf "Execution time for BaseRecalibrator : %.6f seconds" $dur
echo
###PrintReads###
start=$(date +%s.%N)
echo PrintReads `date`
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T PrintReads \
-R $GENOME -I ${sample}_realigned.bam -o ${sample}_recal.bam -BQSR ${sample}_temp.table
samtools index ${sample}_recal.bam
echo PrintReads `date`
dur=$(echo "$(date +%s.%N) - $start" | bc)
printf "Execution time for PrintReads : %.6f seconds" $dur
echo
rm ${sample}_realigned.bam
chmod uga=r ${sample}_recal.bam
#####################################################
############## Step 6: gatk call snp/indel##########
#####################################################
###
start=$(date +%s.%N)
echo HaplotypeCaller `date`
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T HaplotypeCaller \
-R $GENOME -I ${sample}_recal.bam --dbsnp $DBSNP \
-stand_emit_conf 10 -o ${sample}_raw.snps.indels.vcf
echo HaplotypeCaller `date`
dur=$(echo "$(date +%s.%N) - $start" | bc)
printf "Execution time for HaplotypeCaller : %.6f seconds" $dur
echo
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T SelectVariants -R $GENOME \
-selectType SNP \
-V ${sample}_raw.snps.indels.vcf -o ${sample}_raw_snps.vcf
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T SelectVariants -R $GENOME \
-selectType INDEL \
-V ${sample}_raw.snps.indels.vcf -o ${sample}_raw_indels.vcf
##
:'
'
## for SNP
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T VariantFiltration -R $GENOME \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filterName "my_snp_filter" \
-V ${sample}_raw_snps.vcf -o ${sample}_filtered_snps.vcf
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T SelectVariants -R $GENOME \
--excludeFiltered \
-V ${sample}_filtered_snps.vcf -o ${sample}_filtered_PASS.snps.vcf
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T VariantEval -R $GENOME \
-eval ${sample}_filtered_PASS.snps.vcf -o ${sample}_filtered_PASS.snps.vcf.eval
## for INDEL
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T VariantFiltration -R $GENOME \
--filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \
--filterName "my_indel_filter" \
-V ${sample}_raw_indels.vcf -o ${sample}_filtered_indels.vcf
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T SelectVariants -R $GENOME \
--excludeFiltered \
-V ${sample}_filtered_indels.vcf -o ${sample}_filtered_PASS.indels.vcf
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T VariantEval -R $GENOME \
-eval ${sample}_filtered_PASS.indels.vcf -o ${sample}_filtered_PASS.indels.vcf.eval
:'
'
## bam文件的QC
bedtools coverage -hist -abam ${sample}.bam \
-b ~/annotation/CCDS/human/hg19_exon.bed >${sample}.exome.coverage.hist.txt
## 后面的几个质控,代码是我自己写的,所以大家可以不运行。
perl ~/scripts/calc_coverage_depth.pl ~/annotation/CCDS/human/CCDS.20110907.txt ${sample}.bam
比对成功后得到的bam文件如下;
7.7G Aug 26 19:22 NPC10F-N_marked_fixed.bam 7.7G Aug 26 22:59 NPC10F-N_marked_fixed_split.bam 7.7G Aug 26 23:57 NPC10F-N_realigned.bam 15G Aug 27 03:45 NPC10F-N_recal.bam 7.0G Aug 26 19:49 NPC10F-T_marked_fixed.bam 7.0G Aug 26 22:55 NPC10F-T_marked_fixed_split.bam 7.0G Aug 26 23:48 NPC10F-T_realigned.bam 13G Aug 27 03:12 NPC10F-T_recal.bam
Snp-calling结束后得到的vcf如下:
82576 NPC10F-N_raw_indels.vcf 863243 NPC10F-N_raw_snps.vcf 945753 NPC10F-N_recal_raw.snps.indels.vcf 89604 NPC10F-T_raw_indels.vcf 819657 NPC10F-T_raw_snps.vcf 909190 NPC10F-T_recal_raw.snps.indels.vcf
这些vcf里面的变异位点还需要进行简单的过滤,或者只提取外显子区域的变异位点。
消耗时间如下;
# for NPC10F-N_1.fastq.gz NPC10F-N_2.fastq.gz NPC10F-N bwa Sat Aug 26 16:04:44 CST 2017 bwa Sat Aug 26 17:07:11 CST 2017 SortSam Sat Aug 26 17:07:11 CST 2017 SortSam Sat Aug 26 17:45:04 CST 2017 stats Sat Aug 26 17:45:04 CST 2017 plot-bamstats -p NPC10F-N_QC NPC10F-N.alignment.stat stats Sat Aug 26 17:56:07 CST 2017 MarkDuplicates Sat Aug 26 17:56:07 CST 2017 MarkDuplicates Sat Aug 26 18:40:25 CST 2017 FixMateInfo Sat Aug 26 18:40:25 CST 2017 FixMateInfo Sat Aug 26 19:26:39 CST 2017 SplitNCigar Sat Aug 26 22:20:48 CST 2017 SplitNCigar Sat Aug 26 22:59:32 CST 2017 RealignerTargetCreator Sat Aug 26 22:59:32 CST 2017 RealignerTargetCreator Sat Aug 26 23:17:35 CST 2017 IndelRealigner Sat Aug 26 23:17:35 CST 2017 IndelRealigner Sat Aug 26 23:57:35 CST 2017 BaseRecalibrator Sat Aug 26 23:57:35 CST 2017 BaseRecalibrator Sun Aug 27 01:27:39 CST 2017 PrintReads Sun Aug 27 01:27:39 CST 2017 PrintReads Sun Aug 27 03:52:03 CST 2017 #for NPC10F-T_1.fastq.gz NPC10F-T_2.fastq.gz NPC10F-T bwa Sat Aug 26 16:54:14 CST 2017 bwa Sat Aug 26 17:48:07 CST 2017 SortSam Sat Aug 26 17:48:07 CST 2017 SortSam Sat Aug 26 18:20:48 CST 2017 stats Sat Aug 26 18:20:48 CST 2017 plot-bamstats -p NPC10F-T_QC NPC10F-T.alignment.stat stats Sat Aug 26 18:30:45 CST 2017 MarkDuplicates Sat Aug 26 18:30:45 CST 2017 MarkDuplicates Sat Aug 26 19:11:40 CST 2017 FixMateInfo Sat Aug 26 19:11:40 CST 2017 FixMateInfo Sat Aug 26 19:52:37 CST 2017 SplitNCigar Sat Aug 26 22:20:54 CST 2017 SplitNCigar Sat Aug 26 22:55:53 CST 2017 RealignerTargetCreator Sat Aug 26 22:55:53 CST 2017 RealignerTargetCreator Sat Aug 26 23:14:19 CST 2017 IndelRealigner Sat Aug 26 23:14:19 CST 2017 IndelRealigner Sat Aug 26 23:48:43 CST 2017 BaseRecalibrator Sat Aug 26 23:48:43 CST 2017 BaseRecalibrator Sun Aug 27 01:10:26 CST 2017 PrintReads Sun Aug 27 01:10:26 CST 2017 PrintReads Sun Aug 27 03:18:33 CST 2017
外显子的QC结果是:
## for NPC10F-N 32541594 2392028455 0.982188933530586 73.5068004044301 18711840 934414537 0.971564415370185 49.9370739061471 17075251 505674931 0.895198983425405 29.6144947444696 15656543 241509710 0.819070615186704 15.4254812189383 ## for NPC10F-T 32348763 2946257280 0.97636879840624 91.0778962398037 18182204 1054428864 0.94406442121146 57.9923569221861 16075260 555403212 0.842772759844003 34.5501853158207 14181528 265599059 0.741905340358179 18.7285219900141
可以看到T的测序深度高于N,符合实验设计。T的外显子区域平均测序深度高达91X,是比较好的数据了,而且外显子旁边50bp范围区域平均测序深度还有57X,旁边100bp区域还有34X,也说明这款芯片的捕获效率比较好
然后走走somatic mutation calling流程
因为是配对数据,还可以走somatic mutation calling流程
reference=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta
GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta
TMPDIR=/home/jianmingzeng/tmp/software
normal_bam=NPC10F-N_recal.bam
tumor_bam=NPC10F-T_recal.bam
sample=NPC10F
#####################################################
################### Step : Run VarScan #############
#####################################################
normal_pileup="samtools mpileup -q 1 -f $reference $normal_bam";
tumor_pileup="samtools mpileup -q 1 -f $reference $tumor_bam";
# Next, issue a system call that pipes input from these commands into VarScan :
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar ~/biosoft/VarScan/VarScan.v2.3.9.jar \
somatic <($normal_pileup) <($tumor_pileup) $sample
java -jar ~/biosoft/VarScan/VarScan.v2.3.9.jar processSomatic $sample.snp
#####################################################
################### Step : Run Mutect2 #############
#####################################################
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T MuTect2 \
-R $GENOME -I:tumor $tumor_bam -I:normal $normal_bam \
--dbsnp $DBSNP -o ${sample}-mutect2.vcf
#####################################################
################### Step : Run Muse#################
#####################################################
~/biosoft/muse/muse call -O $sample -f $reference $tumor_bam $normal_bam
~/biosoft/muse/muse sump -I ${sample}.MuSE.txt -E –O ${sample}.vcf –D $DBSNP
其中varscan耗费两个半小时,结果如下:
893539210 positions in tumor 891822458 positions shared in normal 79827518 had sufficient coverage for comparison 79718238 were called Reference 26 were mixed SNP-indel calls and filtered 102492 were called Germline 3703 were called LOH 2569 were called Somatic 490 were called Unknown 0 were called Variant Reading input from NPC10F.snp Opening output files: NPC10F.snp.Somatic NPC10F.snp.Germline NPC10F.snp.LOH 101352 VarScan calls processed 2342 were Somatic (556 high confidence) 95144 were Germline (4830 high confidence) 3502 were LOH (3484 high confidence)
这3个软件找到的somatic mutation可以互相对比一下,差异主要是在哪里,都值得自己去探究,这样最终确定的肿瘤外显子测序数据分析流程就更有把握。
需要安装的软件
软件太多了,我就不一一列出具体代码了,还有很多需要下载的参考基因组,变异数据库也是以前直播基因组的时候已经反复提到过,也不赘述啦。
conda install -c bioconda bedtools
conda install -c bioconda bwa
conda install -c bioconda samtools
cd ~/biosoft
mkdir sratoolkit && cd sratoolkit
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-centos_linux64.tar.gz
tar zxvf sratoolkit.current-centos_linux64.tar.gz
~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/fastdump -h
## https://sourceforge.net/projects/picard/
## https://github.com/broadinstitute/picard
cd ~/biosoft
mkdir picardtools && cd picardtools
wget http://ncu.dl.sourceforge.net/project/picard/picard-tools/1.119/picard-tools-1.119.zip
unzip picard-tools-1.119.zip
mkdir 2.9.2 && cd 2.9.2
wget https://github.com/broadinstitute/picard/releases/download/2.9.2/picard.jar
cd ~/biosoft
## https://sourceforge.net/projects/varscan/files/
## http://varscan.sourceforge.net/index.html
mkdir VarScan && cd VarScan
wget https://sourceforge.net/projects/varscan/files/VarScan.v2.3.9.jar
cd ~/biosoft
mkdir SnpEff && cd SnpEff
## http://snpeff.sourceforge.net/
## http://snpeff.sourceforge.net/SnpSift.html
## http://snpeff.sourceforge.net/SnpEff_manual.html
wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
## java -jar snpEff.jar download GRCh37.75
## java -Xmx4G -jar snpEff.jar -i vcf -o vcf GRCh37.75 example.vcf > example_snpeff.vcf
unzip snpEff_latest_core.zip