在很久以前的我直播基因组活动,我提到过这个数据库: 【直播】我的基因组67:clinvar数据库
不过,那个时候遗传背景知识不够,其实并没有很好的理解它,现在有机会重新学习一下,可以使用以下代码下载并且注释到clinvar数据库
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20180429.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20180429.vcf.gz.tbi
java -jar ~/biosoft/SnpEff/snpEff/SnpSift.jar annotate clinvar_20180429.vcf.gz merge_snpeff.vcf >merge_clinvar.vcf
得到的注释信息的描述是:
##SnpSiftCmd="SnpSift annotate clinvar_20180429.vcf.gz merge_snpeff.vcf" ##INFO=<ID=CLNDISDBINCL,Number=.,Type=String,Description="For included Variant: Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN"> ##INFO=<ID=DBVARID,Number=.,Type=String,Description="nsv accessions from dbVar for the variant"> ##INFO=<ID=CLNSIGCONF,Number=.,Type=String,Description="Conflicting clinical significance for this single variant"> ##INFO=<ID=AF_TGP,Number=1,Type=Float,Description="allele frequencies from TGP"> ##INFO=<ID=MC,Number=.,Type=String,Description="comma separated list of molecular consequence in the form of Sequence Ontology ID|molecular_consequence"> ##INFO=<ID=AF_EXAC,Number=1,Type=Float,Description="allele frequencies from ExAC"> ##INFO=<ID=CLNDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB"> ##INFO=<ID=ORIGIN,Number=.,Type=String,Description="Allele origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other"> ##INFO=<ID=CLNVI,Number=.,Type=String,Description="the variant's clinical sources reported as tag-value pairs of database and variant identifier"> ##INFO=<ID=CLNVC,Number=1,Type=String,Description="Variant type"> ##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP"> ##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Top-level (primary assembly, alt, or patch) HGVS expression."> ##INFO=<ID=CLNDNINCL,Number=.,Type=String,Description="For included Variant : ClinVar's preferred disease name for the concept specified by disease identifier s in CLNDISDB"> ##INFO=<ID=CLNVCSO,Number=1,Type=String,Description="Sequence Ontology id for variant type"> ##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Clinical significance for this single variant"> ##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN"> ##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar review status for the Variation ID"> ##INFO=<ID=CLNSIGINCL,Number=.,Type=String,Description="Clinical significance for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:clinical significance."> ##INFO=<ID=ALLELEID,Number=1,Type=Integer,Description="the ClinVar Allele ID">
记录最多的基因是
zcat clinvar_20180429.vcf.gz|perl -alne '{/GENEINFO=(.*?):/;print $1 if $1}'|sort |uniq -c |sort -k 1,1nr >gene.clinvar.freq head gene.clinvar.freq 9800 TTN 9038 BRCA2 6311 BRCA1 4987 ATM 4498 APC 3312 TSC2 3114 MSH6 2888 NF1 2653 MSH2 2223 LDLR
记录的致病情况最多的基因是
zcat clinvar_20180429.vcf.gz|perl -alne '{/GENEINFO=(.*?):/;$g=$1;/CLNSIG=(.*?);/;print "$g\t$1" if $1}'| sort |uniq -c |sort -k 1,1nr >gene_sign.clinvar.freq head gene_sign.clinvar.freq 5218 TTN Uncertain_significance 3293 BRCA2 Uncertain_significance 2665 BRCA2 Pathogenic 2523 ATM Uncertain_significance 2162 BRCA1 Pathogenic 1897 APC Uncertain_significance 1887 TTN Likely_benign 1817 BRCA1 Uncertain_significance 1651 MSH6 Uncertain_significance 1465 BRCA2 Likely_benign grep Pathogenic gene_sign.clinvar.freq|head 2665 BRCA2 Pathogenic 2162 BRCA1 Pathogenic 691 LDLR Pathogenic 594 MSH2 Pathogenic 571 MLH1 Pathogenic 544 COL4A5 Pathogenic 536 NF1 Pathogenic 469 DMD Pathogenic 467 MSH6 Pathogenic 466 APC Pathogenic
看看是否有被专家审核
zcat clinvar_20180429.vcf.gz|perl -alne '{/GENEINFO=(.*?):/;$g=$1;/CLNSIG=(.*?);/;$t=$1;/CLNREVSTAT=(.*?);/;print "$g\t$t\t$1" if $1}'| sort |uniq -c |sort -k 1,1nr >gene_sign_review.clinvar.freq [jianmingzeng@jade anno]$ head gene_sign_review.clinvar.freq 4200 TTN Uncertain_significance criteria_provided,_single_submitter 2065 BRCA2 Pathogenic reviewed_by_expert_panel 1915 BRCA2 Uncertain_significance criteria_provided,_single_submitter 1666 BRCA1 Pathogenic reviewed_by_expert_panel 1639 TTN Likely_benign criteria_provided,_single_submitter 1597 ATM Uncertain_significance criteria_provided,_single_submitter 1251 APC Uncertain_significance criteria_provided,_single_submitter 1200 TTN Conflicting_interpretations_of_pathogenicity criteria_provided,_conflicting_interpretations 1174 TSC2 not_provided no_assertion_provided
grep BRCA2 gene_sign_review.clinvar.freq|grep Pathogenic 2065 BRCA2 Pathogenic reviewed_by_expert_panel 436 BRCA2 Pathogenic criteria_provided,_single_submitter 94 BRCA2 Pathogenic criteria_provided,_multiple_submitters,_no_conflicts 70 BRCA2 Pathogenic no_assertion_criteria_provided