五 28

org.Xx.eg.db系列包概述

Posted on 2015年5月28日 by ulwvfje

在bioconductor的官网里面可以查到共有111个系列包，基本上跨越了我们常见的物种啦！

斑马鱼：Bioconductor - org.Dr.eg.db - /packages/release/data/annotation/html/org.Dr.eg.db.html

Details biocViews AnnotationData , Danio_rerio , OrgDb Version 3

拟南芥：Bioconductor - org.At.tair.db - /packages/release/data/annotation/html/org.At.tair.db.html

Details biocViews AnnotationData , Arabidopsis_thaliana , OrgDb Version 3

小鼠：Bioconductor - org.Mm.eg.db - /packages/release/data/annotation/html/org.Mm.eg.db.html

Details biocViews AnnotationData , Mus_musculus , OrgDb , mouseLLMappings Version 3

人类：Bioconductor - org.Hs.eg.db - /packages/release/data/annotation/html/org.Hs.eg.db.html

Details biocViews AnnotationData , Homo_sapiens , OrgDb , humanLLMappings Version 3

对这些系列包的函数都一样，包括以下几个：

columns(x) keytypes(x) keys(x, keytype, ...) select(x, keys, columns, keytype, ...) saveDb(x, file) loadDb(file, dbType, dbPackage, ...)

这些包就是bioconductor已经做好的数据库，我们可以根据定义好的ID号来进行任意的基因转换，现在支持的信息有一下几种！

keytypes(org.Hs.eg.db)

[1] "ENTREZID" "PFAM" "IPI" "PROSITE" "ACCNUM" "ALIAS" "CHR"

[8] "CHRLOC" "CHRLOCEND" "ENZYME" "MAP" "PATH" "PMID" "REFSEQ"

[15] "SYMBOL" "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "GENENAME" "UNIPROT"

[22] "GO" "EVIDENCE" "ONTOLOGY" "GOALL" "EVIDENCEALL" "ONTOLOGYALL" "OMIM"

[29] "UCSCKG"

这些包的确非常有用，大家可以看我博客里面关于它们的介绍！！！

五 28

菜鸟团第二次作业的部分答案

Posted on 2015年5月28日 by ulwvfje

> library(org.Hs.eg.db)

载入需要的程辑包：AnnotationDbi载入需要的程辑包：stats4载入需要的程辑包：GenomeInfoDb载入需要的程辑包：S4Vectors载入需要的程辑包：IRanges载入程辑包：‘AnnotationDbi’The following object is masked from ‘package:GenomeInfoDb’: species载入需要的程辑包：DBI

1、人共有多少个entrez id的基因呢？

x <- org.Hs.egENSEMBLTRANS

# Get the entrez gene IDs that are mapped to an Ensembl ID

mapped_genes <- mappedkeys(x)

# Convert to a list

xx <- as.list(x[mapped_genes])

length(x)

[1] 47721

可知共有47721个基因都是有entrez ID号的

2、能对应转录本ID的基因有多少个呢？

length(xx)

[1] 20592

可以看到共有20592个基因都是有转录本的！

2、能对应ensembl的gene ID的基因有多少个呢？

x <- org.Hs.egENSEMBL

# Get the entrez gene IDs that are mapped to an Ensembl ID

mapped_genes <- mappedkeys(x)

# Convert to a list

xx <- as.list(x[mapped_genes])

> length(x)

[1] 47721

> length(xx)

[1] 26019

可以看到只有26019是有ensembl的gene ID的

3、那么基因对应的转录本分布情况如何呢？

table(unlist(lapply(xx,length)))

可以看出绝大部分的基因都是20个转录本一下的，但也有极个别基因居然有高达两百个转录本，很可怕！

4、那么基因在染色体的分布情况如何呢？

x <- org.Hs.egCHR

# Get the entrez gene identifiers that are mapped to a chromosome

mapped_genes <- mappedkeys(x)

# Convert to a list

xx <- as.list(x[mapped_genes])

> length(x)

[1] 47721

> length(xx)

[1] 47232

可以看到有接近五百个基因居然是没有染色体定位信息的！！！

table(unlist(xx))

用barplot函数可视化一下，如图

6、那么有多多少基因是有GO注释的呢？

x <- org.Hs.egGO

# Get the entrez gene identifiers that are mapped to a GO ID

mapped_genes <- mappedkeys(x)

# Convert to a list

xx <- as.list(x[mapped_genes])

length(xx)

[1] 18229

> length(x)

[1] 47721

可以看到只有18229个基因是有go注释信息的。

那么基因被注释的go的分布如何呢？

可以看到大部分的基因都是只有30个go的，但是某些基因特别活跃，高达197个go注释。

还有kegg和omin数据库的我就不写了！

五 28

实战R语言bioconductor的seqinr包探究人的所有转录本的性质

Posted on 2015年5月28日 by ulwvfje

首先安装这个包

source("http://bioconductor.org/biocLite.R")

biocLite("seqinr")

然后加载包，并读取我们的CDS.fa文件

library("seqinr")

human_cds=read.fasta("CDS.fa")

#这一个步骤非常耗时间，可能是因为我们的转录本文件有十万多个转录本的原因吧

str(human_cds) #查看可知读入了一个list，其中每个转录本都是list的一个元素

List of 100778

$ ENST00000415118:Class 'SeqFastadna' atomic [1:8] g a a a ...

.. ..- attr(*, "name")= chr "ENST00000415118"

.. ..- attr(*, "Annot")= chr ">ENST00000415118 havana_ig_gene:known chromosome:GRCh38:14:22438547:22438554:1 gene:ENSG00000223997 gene_biotype:TR_D_gene tran"| __truncated__

$ ENST00000448914:Class 'SeqFastadna' atomic [1:13] a c t g ...

.. ..- attr(*, "name")= chr "ENST00000448914"

.. ..- attr(*, "Annot")= chr ">ENST00000448914 havana_ig_gene:known chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985 gene_biotype:TR_D_gene tran"| __truncated__

对list的每个元素都有几种函数可以处理得到信息：

Length,table,GC,count

其中count函数很有趣，数一数序列里面的这些组合出现的次数

count(dengueseq, 1)

count(dengueseq, 2)接下来我们随机取human_cds这个list的一个元素用这几个函数对它处理一下

> tmp=human_cds[[1]]

> tmp

[1] "g" "a" "a" "a" "t" "a" "g" "t"

attr(,"name")

[1] "ENST00000415118"

attr(,"Annot")

[1] ">ENST00000415118 havana_ig_gene:known chromosome:GRCh38:14:22438547:22438554:1 gene:ENSG00000223997 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene"

attr(,"class")

[1] "SeqFastadna"

再看看函数的结果

> length(tmp)

[1] 8

> table(tmp)

tmp

a g t

4 2 2

> GC(tmp)

[1] 0.25

> count(tmp,1)

a c g t

4 0 2 2

> count(tmp,2)

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

2 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0

还是挺好用的，接下来我们应用R的知识来对着十万多个转录本进行一些简单的总结

human_cds_length=unlist(lapply(human_cds,length))

human_cds_gc=unlist(lapply(human_cds,GC))

这样就得到了所有转录本的长度和GC含量信息

然后我们简单统计一下，并画几个图表吧！

> summary(human_cds_length)

Min. 1st Qu. Median Mean 3rd Qu. Max.

3 366 699 1132 1425 108000

> summary(human_cds_gc)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.1467 0.4577 0.5285 0.5264 0.5932 0.8917

可以看到还是有很多很极端的转录本的存在！

最长的转录本也不过10k，而我记得最长的基因高达8M，看了内含子远大于外显子呀。

但是GC含量有很多高于80%，这些基因在二代测序的研究中是一个盲区。

这些极端基因可以通过biomaRt等包进行注释，得到gene名和功能信息。

hist(human_cds_gc)

hist(log10(human_cds_length))

GC含量分布如图

长度分布如图

附表：

http://www.bioinformatics.org/sms/iupac.html 所有字符的碱基氨基酸意义表格

生信菜鸟团

欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee

Daily Archives: 2015年5月28日

org.Xx.eg.db系列包概述

菜鸟团第二次作业的部分答案

> library(org.Hs.eg.db)

实战R语言bioconductor的seqinr包探究人的所有转录本的性质

首先安装这个包

2015年5月
一	二	三	四	五	六	日
« 四				六 »
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31