Gene Set Knowledgebase (GSKB),完全借鉴于GSEA算法的MSigDB (molecular signature database),数据库,同样是大名鼎鼎的broad开发,也是分成7类:
- Gene Ontology
- Curated pathways
- Metabolic Pathways
- Transcription Factor (TF)
- microRNA target genes,
- location (cytogenetics band)
- others
收集整理了来自于40余个不同的知识数据库,得到了 33,261 个基因集。
安装gskb这个R包
安装并且查看 PDF教程:
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
biocLite("gskb")
library(gskb)
browseVignettes("gskb")
biocLite("PGSEA")
最新版教程: https://bioconductor.org/packages/release/data/experiment/html/gskb.html
查看内置数据集
数据集分成7个,可以分别查看:
library(gskb) data(mm_miRNA) mm_miRNA[[1]][1:10]
mm_GO Gene Ontology Data for Mouse mm_location Chromosomal Location Data for Mouse mm_metabolic Metabolic Pathways Data for Mouse mm_miRNA miRNA Target Genes Data for Mouse mm_other Other Data for Mouse mm_pathway Pathway Data for Mouse mm_TF Transcription Factor Target Genes Data for Mouse
存储在该包的基因集格式是:
[1] "MIRNA_MM_BETEL_MMU-LET-7A" [2] "BETEL_MMU-LET-7A; Good mirSVR score Conserved; The microRNA.org resource: targets and expression." [3] "NSUN4" [4] "DCX" [5] "KCNK6" [6] "PBX1" [7] "PHF8" [8] "RACGAP1" [9] "EFHD2" [10] "DCBLD2"
可以看到前两个元素其实并不是基因,需要额外注意哦。
library(PGSEA)
library(gskb)
data(mm_miRNA)
gse<-read.csv("http://ge-lab.org/gskb/GSE40261.csv",header=TRUE, row.name=1)
# Gene are centered by mean expression
gse <- gse - apply(gse,1,mean)
pg <- PGSEA(gse, cl=mm_miRNA, range=c(15,2000), p.value=NA)
# Remove pathways that has all NAs. This could be due to that pathway has
# too few matching genes.
pg2 <- pg[rowSums(is.na(pg))!= dim(gse)[2], ]
# 数据集内置的是1868个基因集,剩下 1668个。
# Difference in Average Z score in two groups of samples is calculated and
# the pathways are ranked by absolute value.
diff <- abs( apply(pg2[,1:4],1,mean) - apply(pg2[,5:8], 1, mean) )
pg2 <- pg2[order(-diff), ]
sub <- factor( c( rep("Control",4),rep("Anti-miR-29",4) ) )
smcPlot(pg2[1:15,],sub,scale=c(-12,12),show.grid=TRUE,margins=c(1,1,7,19),col=.rwb)
数据集来源于: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40261 发表于2012年,关于 Hepatic gene expression changes following antisense oligonucleotide-based inhibition of miR-29a
这个表达矩阵的样本是:
> colnames(gse) [1] "GSM989360_Control1" "GSM989361__Control2" "GSM989362_Control3" [4] "GSM989363__Control4" "GSM989364_Anti.miR.29_rep1" "GSM989365_Anti.miR.29_rep1" [7] "GSM989366_Anti.miR.29_rep3" "GSM989367_Anti.miR.29_rep4"
有了每个基因集在每个样本的打分,以及样本的描述信息,就可以自由的做下游分析了。