针对小鼠的gskb基因集数据库

Gene Set Knowledgebase (GSKB)，完全借鉴于GSEA算法的MSigDB (molecular signature database),数据库，同样是大名鼎鼎的broad开发，也是分成7类：

Gene Ontology
Curated pathways
Metabolic Pathways
Transcription Factor (TF)
microRNA target genes,
location (cytogenetics band)
others

收集整理了来自于40余个不同的知识数据库，得到了 33,261 个基因集。

安装gskb这个R包

安装并且查看 PDF教程：

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
biocLite("gskb")
library(gskb)
browseVignettes("gskb")  
biocLite("PGSEA")

查看内置数据集

数据集分成7个，可以分别查看：

library(gskb) 
data(mm_miRNA)
mm_miRNA[[1]][1:10]

mm_GO   Gene Ontology Data for Mouse
mm_location Chromosomal Location Data for Mouse
mm_metabolic    Metabolic Pathways Data for Mouse
mm_miRNA    miRNA Target Genes Data for Mouse
mm_other    Other Data for Mouse
mm_pathway  Pathway Data for Mouse
mm_TF   Transcription Factor Target Genes Data for Mouse

存储在该包的基因集格式是：

 [1] "MIRNA_MM_BETEL_MMU-LET-7A"                                                                        
 [2] "BETEL_MMU-LET-7A; Good mirSVR score Conserved; The microRNA.org resource: targets and expression."
 [3] "NSUN4"                                                                                            
 [4] "DCX"                                                                                              
 [5] "KCNK6"                                                                                            
 [6] "PBX1"                                                                                             
 [7] "PHF8"                                                                                             
 [8] "RACGAP1"                                                                                          
 [9] "EFHD2"                                                                                            
[10] "DCBLD2"

可以看到前两个元素其实并不是基因，需要额外注意哦。

基因集的差异分析

library(PGSEA)
library(gskb)
data(mm_miRNA)
gse<-read.csv("http://ge-lab.org/gskb/GSE40261.csv",header=TRUE, row.name=1)
# Gene are centered by mean expression
gse <- gse - apply(gse,1,mean)  

pg <- PGSEA(gse, cl=mm_miRNA, range=c(15,2000), p.value=NA)
# Remove pathways that has all NAs. This could be due to that pathway has 
# too few matching genes. 
pg2 <- pg[rowSums(is.na(pg))!= dim(gse)[2], ]
# 数据集内置的是1868个基因集，剩下 1668个。
# Difference in Average Z score in two groups of samples is calculated and 
# the pathways are ranked by absolute value.
diff <- abs( apply(pg2[,1:4],1,mean) - apply(pg2[,5:8], 1, mean) )
pg2 <- pg2[order(-diff), ]  

sub <- factor( c( rep("Control",4),rep("Anti-miR-29",4) ) ) 
smcPlot(pg2[1:15,],sub,scale=c(-12,12),show.grid=TRUE,margins=c(1,1,7,19),col=.rwb)

数据集来源于： https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40261 发表于2012年，关于 Hepatic gene expression changes following antisense oligonucleotide-based inhibition of miR-29a

这个表达矩阵的样本是：

> colnames(gse)
[1] "GSM989360_Control1"         "GSM989361__Control2"        "GSM989362_Control3"        
[4] "GSM989363__Control4"        "GSM989364_Anti.miR.29_rep1" "GSM989365_Anti.miR.29_rep1"
[7] "GSM989366_Anti.miR.29_rep3" "GSM989367_Anti.miR.29_rep4"

有了每个基因集在每个样本的打分，以及样本的描述信息，就可以自由的做下游分析了。

一	二	三	四	五	六	日
« 九
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

生信菜鸟团

欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee

针对小鼠的gskb基因集数据库

安装gskb这个R包

查看内置数据集

基因集的差异分析