通常一个人的全基因组测序数据可以挖掘到四百万个SNVs(跟参考基因组不一样的单碱基位点),还有五十万的indels(insertions or deletions),但是得到的数据通常是以vcf文件格式给出的(自行搜索什么是vcf格式),比如下面:
TCGA数据库是最大的癌症基因信息的数据库。TCGA中的somatic mutation大全非常重要,里面搜集的是TCGA计划里面各个癌症里面总结的somatic mutation,如果我们手头的样本的变异文件里面跟它有交集,那这就有些危险了。用下面的代码下载!
for i in `cut -f 2 GDC_open_MAFs_manifest.txt`
do
echo $i
adress=`echo $i |cut -d'.' -f 4 `
filename=`echo $i |cut -f 2 |cut -d'.' -f 1-3,5-7 `
echo $adress $filename
wget -O "$filename" "https://gdc-api.nci.nih.gov/data/$adress"
done
其中,还有一些数据库是需要注册的,就没办法给出下载地址了,比如COSMIC,这个同样是关于癌症的数据库,我们也不希望正常人里面出现这些突变!附图给大家看看注册的界面。
如果是正常人数据库,那么我们就需要把找到的正常人的变异位点在它里面出现的过滤掉,不研究了,因为正常人有这个变异也正常(当然也并不不绝对),比如说千人基因组计划。
下载千人基因组数据库。
mkdir -p ~/annotation/variation/human/1000genomes
cd ~/annotation/variation/human/1000genomes
## ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
nohup wget -c -r -nd -np -k -L -p ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502 &
还有一些其常常用数据库,我就不一一介绍了(#后是对应数据库的说明,大家可自行查看)
mkdir -p ~/annotation/variation/human/ExAC
cd ~/annotation/variation/human/ExAC
## http://exac.broadinstitute.org/
## ftp://ftp.broadinstitute.org/pub/ExAC_release/current
wget ftp://ftp.broadinstitute.org/pub/ExAC_release/current/ExAC.r0.3.1.sites.vep.vcf.gz.tbi
nohup wget ftp://ftp.broadinstitute.org/pub/ExAC_release/current/ExAC.r0.3.1.sites.vep.vcf.gz &
wget ftp://ftp.broadinstitute.org/pub/ExAC_release/current/cnv/exac-final-cnv.gene.scores071316
mkdir -p ~/annotation/variation/human/dbSNP
cd ~/annotation/variation/human/dbSNP
## https://www.ncbi.nlm.nih.gov/projects/SNP/
## ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh38p2/
## ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/
nohup wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/All_20160601.vcf.gz &
wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/All_20160601.vcf.gz.tbi
mkdir -p ~/annotation/variation/human/ESP6500
cd ~/annotation/variation/human/ESP6500
# http://evs.gs.washington.edu/EVS/
nohup wget http://evs.gs.washington.edu/evs_bulk_data/ESP6500SI-V2-SSA137.GRCh38-liftover.snps_indels.vcf.tar.gz &
mkdir -p ~/annotation/variation/human/UK10K
cd ~/annotation/variation/human/UK10K
# http://www.uk10k.org/
nohup wget ftp://ngs.sanger.ac.uk/production/uk10k/UK10K_COHORT/REL-2012-06-02/UK10K_COHORT.20160215.sites.vcf.gz &
mkdir -p ~/annotation/variation/human/gonl
cd ~/annotation/variation/human/gonl
## http://www.nlgenome.nl/search/
## https://molgenis26.target.rug.nl/downloads/gonl_public/variants/release5/
nohup wget -c -r -nd -np -k -L -p https://molgenis26.target.rug.nl/downloads/gonl_public/variants/release5 &
## The Affymetrix Genome-Wide Human SNP Array 6.0 && The Illumina Human1M single BeadChip
## http://www.statgen.nus.edu.sg/~SGVP/
## http://www.statgen.nus.edu.sg/~SGVP/singhap/files-website/samples-information.txt
# http://www.statgen.nus.edu.sg/~SGVP/singhap/files-website/genotypes/2009-01-30/QC/
## Singapore Sequencing Malay Project (SSMP)
mkdir -p ~/annotation/variation/human/SSMP
cd ~/annotation/variation/human/SSMP
## http://www.statgen.nus.edu.sg/~SSMP/
## http://www.statgen.nus.edu.sg/~SSMP/download/vcf/2012_05
## Singapore Sequencing Indian Project (SSIP)
mkdir -p ~/annotation/variation/human/SSIP
cd ~/annotation/variation/human/SSIP
# http://www.statgen.nus.edu.sg/~SSIP/
## http://www.statgen.nus.edu.sg/~SSIP/download/vcf/dataFreeze_Feb2013
请扫描以下二维码关注我们,获取直播系列的所有帖子!