十一 11

数据库批量注释不可盲目-annovar数据库错误

Posted on 2016年11月11日 by ulwvfje

我对H3F3A这个基因做了两个突变的cellline，分别是G34V和K27M，现在知道这个基因在hg38上面的坐标是：

Genomic Location for H3F3A Gene
Chromosome: 1
Start:226,061,851 bp from pter End:226,072,002 bp from pter
Size:10,152 bases Orientation:Plus strand

然后我用samtools结合bcftools把该基因区域的snp位点call出来：

samtools mpileup -r chr1:226061851-226072001 -t "DP4" -ugf ~/reference/genome/hg38/hg38.fa *sorted.bam | bcftools call -vmO z -o H3F3A.vcf.gz

Continue reading →

十一 01

WES（六）用annovar注释

Posted on 2015年11月1日 by ulwvfje

使用annovar软件参考自：http://www.bio-info-trainee.com/?p=641

/home/jmzeng/bio-soft/annovar/convert2annovar.pl -format vcf4 Sample3.varscan.snp.vcf > Sample3.annovar

/home/jmzeng/bio-soft/annovar/convert2annovar.pl -format vcf4 Sample4.varscan.snp.vcf > Sample4.annovar

/home/jmzeng/bio-soft/annovar/convert2annovar.pl -format vcf4 Sample5.varscan.snp.vcf > Sample5.annovar

然后用下面这个脚本批量注释

Reading gene annotation from /home/jmzeng/bio-soft/annovar/humandb/hg19_refGene.txt ... Done with 50914 transcripts (including 11516 without coding sequence annotation) for 26271 unique genes

最后查看结果可知，真正在外显子上面的突变并不多

23515 Sample3.anno.exonic_variant_function

23913 Sample4.anno.exonic_variant_function

24009 Sample5.anno.exonic_variant_function

annovar软件就是把我们得到的十万多个snp分类了，看看这些snp分别是基因的哪些位置，是否引起蛋白突变

downstream

exonic

exonic;splicing

intergenic

intronic

ncRNA_exonic

ncRNA_intronic

ncRNA_splicing

ncRNA_UTR3

ncRNA_UTR5

splicing

upstream

upstream;downstream

UTR3

UTR5

UTR5;UTR3

四 16

Annovar使用记录

Posted on 2015年4月16日 by ulwvfje

至于如何安装该软件，请见上一个教程

一．首先把snp-calling步骤的VCF文件转为annovar软件要求的格式

convert2annovar.pl -format vcf4 12.vcf >12.annovar

二．进行注释

命令行参数比较多，还是用脚本来运行

# define path

infolder=/home/jmzeng/hoston/diff

outfolder=$infolder

annovardb=/home/jmzeng/bio-soft/annovar/humandb

# start annotating

/home/jmzeng/bio-soft/annovar/annotate_variation.pl \

--buildver hg19 \

--geneanno \

--outfile ${outfolder}/12.anno \

${infolder}/12.annovar \

${annovardb}

三．输出结果解读

2.6M Apr 14 22:32 12.anno.exonic_variant_function

1.9K Apr 14 22:32 12.anno.log

1.3M Apr 14 22:32 12.anno.variant_function

重点是后缀为exonic_variant_function，这个文件对每一个vcf的突变都进行了注释。

这个结果就可以用来解析了，可以根据实验设计来找到自己感兴趣的突变。

第5.6列是染色体及pos坐标

第4列信息非常复杂，是突变的注释

第12列是测序深度，一般要大于20

我这里是先把注释文件转换成以下格式

location:chr1:874467 SAMD11:NM_152486:exon6:c.G478A:p.D160N

location:chr1:888639 NOC2L:NM_015658:exon9:c.A918G:p.E306E

location:chr1:888659 NOC2L:NM_015658:exon9:c.A898G:p.I300V

location:chr1:916549 PERM1:NM_001291367:exon2:c.T58C:p.W20R

location:chr1:949608 ISG15:NM_005101:exon2:c.G248A:p.S83N

location:chr1:980552 AGRN:NM_198576:exon13:c.G2266A:p.A756T

location:chr1:1114699 TTLL10:NM_001130045:exon4:c.G104A:p.R35Q

location:chr1:1158631 SDF4:NM_016176:exon4:c.T570C:p.D190D

location:chr1:1158631 SDF4:NM_016547:exon4:c.T570C:p.D190D

location:chr1:1164073 SDF4:NM_016176:exon2:c.C101T:p.A34V

然后比较两个文件，取不同的突变来格式化输出。

四 10

对snp进行注释并格式化输出

Posted on 2015年4月10日 by ulwvfje

前面我已经讲了如何用annovar来把vcf格式的snp进行注释，注释之后大概是这样的，每个snp位点的坐标，已经在哪个基因上面，都标的很清楚啦，。而且该突变是在哪个基因的哪个转录本的哪个外显子都一清二楚，更强大的是，还能显示是第几个碱基突变成第几个，同样氨基酸的突变情况也很清楚。

但是这样不是很方便浏览具体突变情况，所以我写了一个脚本格式化该突变情况。

理论上是应该要做出上面这个样子，突变氨基酸前后各12个氨基酸都显示出来，突变的那个还要标红色突出显示！但是颜色控制很麻烦，我就没有做。效果如下

实现这样的格式化输出有三个重点，首先是NM开头的refseq的ID号要转换为ensembl数据库的转录本ID号，还有找到该转录本的CDS序列，这个都需要在biomart里面转换，或者自己写脚本，然后就用脚本爬取即可！

代码如下

[perl]

open FH1,"NM2ensembl.txt";

while(<FH1>){

chomp;

@F=split;

$hash_nm_enst{$F[4]}=$F[1] if $F[4];

}

open FH2,"ENST.CDS.fa";

while($line=<FH2>){

chomp $line;

if ($line=~/>/) {$key = (split /\|/,$line)[1];}

else {$hash_nucl{$key}.=$line;}

}

open FH3,"ENST.protein";

while($line=<FH3>){

chomp $line;

if ($line=~/>/) {$key = (split /\|/,$line)[1];}

else {$hash_prot{$key}.=$line;}

}

open FH4,"raw.mutiple.txt";

$i=1;

while(<FH4>){

chomp;

@F=split;

@tmp=split/:/,$F[1];

/:exon(\d+):/;$exon=$1;

/(NM_\d+)/; $nm=$1;

$enst=$hash_nm_enst{$nm};

print "$i. $tmp[0] $F[0] the $exon -th exon(s) of $enst \n";

$i++;

$tmp[3]=~/(\d+)/;$num_nucl=$1;

$tmp[3]=~/>([ATCG])/;$mutation_nucl=$1;

$tmp[4]=~/(\d+)/;$num_prot=$1;

$sequence=$hash_nucl{$enst};

$num_up=3*$num_prot-39;

$out_nucl=substr($sequence,$num_up,75);

print "WT:$out_nucl\n ";

for(my $j=0; $j < (length($out_nucl) - 2) ; $j += 3)

{print ' ';print $codon{substr($out_nucl,$j,3)} ;print ' ';}

print "\n";

$mutation_pos=$num_nucl-$num_up-1;

substr($out_nucl,$mutation_pos,1,$mutation_nucl) if ((length $out_nucl) == 75 );

print "MU:$out_nucl\n ";

for(my $j=0; $j < (length($out_nucl) - 2) ; $j += 3)

{print ' ';print $codon{substr($out_nucl,$j,3)} ;print ' ';}

print "\n";

}

[/perl]

三 23

用annovar对snp进行注释

Posted on 2015年3月23日 by ulwvfje

一、下载及安装软件

这个软件需要edu邮箱注册才能下载，可能是仅对科研高校开放吧。所以软件地址我就不列了。

它其实是几个perl程序，比较重要的是这个人类的数据库，snp注释必须的。

参考：http://annovar.readthedocs.org/en/latest/misc/accessory/

二，准备数据

既然是注释，那当然要有数据库啦！数据库倒是有下载地址

http://www.openbioinformatics.org/annovar/download/hg19_ALL.sites.2010_11.txt.gz

也可以用命令来下载

Perl ./annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene humandb/

然后我们是对snp-calling流程跑出来的VCF文件进行注释，所以必须要有自己的VCF文件，VCF格式详解见本博客另一篇文章，或者搜索也行

http://vcftools.sourceforge.net/man_latest.html

三、运行的命令

首先把vcf格式文件，转换成空格分隔格式文件，自己写脚本也很好弄

perl convert2annovar.pl -format vcf

/home/jmzeng/raw-reads/whole-exon/snp-calling/tmp1.vcf >annovar.input

变成了空格分隔的文件

然后把转换好的数据进行注释即可

./annotate_variation.pl -out ex1 -build hg19 example/ex1.avinput humandb/

四，输出文件解读

生信菜鸟团

欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee

Tag Archives: annovar

数据库批量注释不可盲目-annovar数据库错误

WES（六）用annovar注释

Annovar使用记录

对snp进行注释并格式化输出

用annovar对snp进行注释

2025年10月
一	二	三	四	五	六	日
« 九
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31