一个表达芯片数据处理实例

这个实例上部分包括:
如何用R包下载GEO数据(只限单一平台,其余平台需要修改下面的代码)
如何对GEO的芯片数据归一化并且得到表达量矩阵,
如何用limma包做差异分析,
对找到的差异基因如何做GO和KEGG注释

首先下载两个GEO数据:
平台是:Affymetrix U133 gene chips
67 diseased triple negative breast cancer samples(GSE31519 )and 42 control samples (GSE20437)
都是表达量数据,同一种芯片。分成两组,正好做差异表达。
数据来源的文献是:
文章title:A clinically relevant gene signature in triple negative and basal-like breast cancer
结论(We describe a ratio of high B-cell presence and low IL-8 activity as a powerful new prognostic marker for TNBC. )
Platform: GPL96 67 Samples
Download data: GEO (CEL, TXT)
SeriesAccession: GSE31519ID: 200031519
文章title:Histologically normal epithelium from breast cancer patients and cancer-free prophylactic mastectomy patient

Platform: GPL96 67 Samples
Download data: GEO (CEL, TXT)
SeriesAccession: GSE31519  ID: 200031519

Platform: GPL96 Series: GSE20437   42 Samples

Download data: GEO (CEL)
DataSetAccession: GDS3716  ID: 3716
我首先用R的GEOquery包来下载。(其实你完全可以直接去GEO网站下载数据,然后解压的)
suppressMessages(library(GEOquery))
setwd("D:\\test_analysis\\TNBC")
gse31519=getGEO("GSE31519",GSEMatrix = T,destdir = "./")
getGEOSuppFiles("GSE31519",baseDir = "./")
gse31519=getGEO("GSE20437",GSEMatrix = T,destdir = "./")
getGEOSuppFiles("GSE20437",baseDir = "./")
这样下载之后的数据都存在D:\\test_analysis\\TNBC下面
接下来我们就用affy包和limma来进行差异分析:
library(affy)
library(limma)
affy.data=ReadAffy(celfile.path="./cel_files")
请先搞清楚,ReadAffy 这个函数的用法!

当前工作目录下面有没有cel_files文件夹?

cel_files文件夹下面有没有文件?
eset.rma=rma(affy.data)
exprSet=exprs(eset.rma)
write.table(exprSet,"expr_rma_matrix.txt",quote=F,sep="\t")
group=factor(c(rep("control",42),rep("case",67)))
design = model.matrix(~0+group)
colnames(design)=c("case","control")
rownames(design)=sampleNames(affy.data)
fit=lmFit(exprSet,design)
cont.matrix = makeContrasts(contrasts="case-control",levels=design)
fit2=contrasts.fit(fit,cont.matrix)
fit2=eBayes(fit2)
diff_dat=topTable(fit2,coef=1,n=Inf)
write.table(diff_dat,"diff_dat.txt",quote=F)
 
这样得到的diff_dat就是我们差异分析的结果啦
we choose the log fold cut off change to be “2” to get a manageable set of genes.
原文说:we were able to get a list of 2567 genes after removing the duplicates and the not available genes
我们仅仅根据一个标准来挑选差异基因, the log fold cut off change to be “2”,我只挑出来了782个探针

接下来对这些探针进行注释,得到基因名,我这里用biomart包来进行注释
我们的平台是:Affymetrix U133 gene chips,虽然有22283个探针,但是只有13908个基因
所以代码如下:
ensembl=useMart("ensembl",dataset="hsapiens_gene_ensembl")
gene_probe=getBM(attributes=c("hgnc_symbol","affy_hg_u133a"),filter="affy_hg_u133a",value=rownames(diff_dat),mart=ensembl)
diff_probe=rownames(diff_dat[abs(diff_dat[,1])>2,])
diff_gene=gene_probe[match(diff_probe,gene_probe[,2]),1]
diff_gene=na.omit(diff_gene)
diff_gene=unique(diff_gene)
length(diff_gene)
这样会得到604个差异基因
然后我做一下GO和KEGG的富集分析
gene_entrez=getBM(attributes=c("hgnc_symbol","entrezgene"),filter="hgnc_symbol",value=diff_gene,mart=ensembl)
require(DOSE)
require(clusterProfiler)
gene_entrez=na.omit(gene_entrez)
gene=as.character(gene_entrez[,2])
ego <- enrichGO(gene=gene,organism="human",ont="CC",pvalueCutoff=0.01,readable=TRUE)
ekk <- enrichKEGG(gene=gene,organism="human",pvalueCutoff=0.01,readable=TRUE)
write.csv(summary(ekk),"KEGG-enrich.csv",row.names =F)
write.csv(summary(ego),GO-enrich.csv,row.names =F)

懒得上传图片了,大家可以用同样的代码自己实现所有的流程

Comments are closed.