了解5个乳腺癌表达数据集
最近需要学习使用genefu
这个包,然后应用到自己的数据里面,发现这个包的说明书里面提到了5个乳腺癌表达数据集,安装如下:
source("http://bioconductor.org/biocLite.R")
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
biocLite("genefu")
biocLite("breastCancerMAINZ",ask=F,suppressUpdates=T)
biocLite("breastCancerTRANSBIG",ask=F,suppressUpdates=T)
biocLite("breastCancerUPP",ask=F,suppressUpdates=T)
biocLite("breastCancerUNT",ask=F,suppressUpdates=T)
biocLite("breastCancerNKI",ask=F,suppressUpdates=T)
这5个数据集都是以前的研究者发表的,它们 Mainz, Transbig, UPP, and UNT 数据集 分别对应的是: GSE11121,GSE7390,GSE3494,GSE2990.不过NKI数据集并没有上传在GEO里面,是从作者的补充材料里面整理的。
总共1123个病人的数据,临床信息也比较完善。
GSE11121
发表该数据的文章是The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Res 2008 Jul 1;68(13):5405-13. PMID: 18593943
使用的是GPL96[HG-U133A] Affymetrix Human Genome U133A Array芯片,we analyzed the gene expression patterns of 200 tumors of patients who were not treated by systemic therapy after surgery using a discovery approach.
对这些病人收集了一些临床信息如下:
- the biological process of proliferation
- steroid hormone receptor expression
- B cell and T cell infiltration.
GSE7390
发表该数据的文章是:Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res 2007 Jun 1;13(11):3207-14. PMID: 17545524
使用的是 GPL96[HG-U133A] Affymetrix Human Genome U133A Array 芯片,Gene expression profiling of frozen samples from 198 N- systemically untreated patients was performed at the Bordet Institute, blinded to clinical data and independent of Veridex.
GSE3494
发表该数据集的文章是:An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci U S A 2005 Sep 20;102(38):13550-5. PMID: 16141321
使用的是 GPL96[HG-U133A] Affymetrix Human Genome U133A Array 芯片,freshly frozen breast tumors from a population-based cohort of 315 women representing 65% of all breast cancers resected in Uppsala County, Sweden, from January 1, 1987 to December 31, 1989.
收集的患者信息比较齐全:
INDEX (ID)
p53 seq mut status (p53+=mutant; p53-=wt)
p53 DLDA classifier result (0=wt-like, 1=mt-like)
DLDA error (1=yes, 0=no)
Elston histologic grade
ER status
PgR status
age at diagnosis
tumor size (mm)
Lymph node status
DSS TIME (Disease-Specific Survival Time in years)
DSS EVENT (Disease-Specific Survival EVENT; 1=death from breast cancer, 0=alive or censored )
GSE2990
发表该数据集的文章是: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 2006 Feb 15;98(4):262-72. PMID: 16478745
采用的是 GPL96[HG-U133A] Affymetrix Human Genome U133A Array芯片,We analyzed microarray data from 189 invasive breast carcinomas and from three published gene expression datasets from breast carcinomas.
因为其重新利用了 GSE3494 的数据,所以 The patients coming from Uppsala Hospital have been also used in other studies as in GSE3494. You can find the common set of patients in removing the abbreviation “UPP_” from the sample names and compare the results with the “INDEX (ID)” from the GSE3494 series.
数据载入R
因为genefu
这个包已经把这5个数据集处理好了,可以直接加载到R里面查看。
library(breastCancerMAINZ)
library(breastCancerTRANSBIG)
library(breastCancerUPP)
library(breastCancerUNT)
library(breastCancerNKI)
data(breastCancerData)
data.all <- c("transbig7g"=transbig7g, "unt7g"=unt7g, "upp7g"=upp7g,
"mainz7g"=mainz7g, "nki7g"=nki7g)
很清楚的可以看到数据集如下:
> data.all
$transbig7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 198 samples
element names: exprs
protocolData: none
phenoData
sampleNames: VDXGUYU_4002 VDXGUYU_4008 ... VDXRHU_5240 (198 total)
varLabels: samplename dataset ... e.os (21 total)
varMetadata: labelDescription
featureData
featureNames: 205225_at 216836_s_at ... 202763_at (7 total)
fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 17545524
Annotation: hgu133a
$unt7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 137 samples
element names: exprs
protocolData: none
phenoData
sampleNames: OXFU_104 OXFU_1065 ... KIU_89A64 (137 total)
varLabels: samplename dataset ... e.os (21 total)
varMetadata: labelDescription
featureData
featureNames: 205225_at 216836_s_at ... 202763_at (7 total)
fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 16478745
Annotation: hgu133ab
$upp7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 251 samples
element names: exprs
protocolData: none
phenoData
sampleNames: UPP_103B41 UPP_104B91 ... UPP_9B52 (251 total)
varLabels: samplename dataset ... e.os (21 total)
varMetadata: labelDescription
featureData
featureNames: 205225_at 216836_s_at ... 202763_at (7 total)
fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 16141321
Annotation: hgu133ab
$mainz7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 200 samples
element names: exprs
protocolData: none
phenoData
sampleNames: MAINZ_BC6001 MAINZ_BC6002 ... MAINZ_BC6232 (200 total)
varLabels: samplename dataset ... e.os (21 total)
varMetadata: labelDescription
featureData
featureNames: 205225_at 216836_s_at ... 202763_at (7 total)
fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 18593943
Annotation: hgu133a
$nki7g
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7 features, 337 samples
element names: exprs
protocolData: none
phenoData
sampleNames: NKI_4 NKI_6 ... NKI_404 (337 total)
varLabels: samplename dataset ... e.os (21 total)
varMetadata: labelDescription
featureData
featureNames: NM_000125 NM_004448 ... NM_004346 (7 total)
fvarLabels: probe EntrezGene.ID ... Description (10 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: rosetta
因为最后一个数据集是agilent公司的,前面的数据都是affy公司的芯片,所以可以拿它来练手批次效应的矫正算法。
dn <- c("transbig", "unt", "upp", "mainz", "nki")
dn.platform <- c("affy", "affy", "affy", "affy", "agilent")
参考:http://genomicsclass.github.io/book/pages/svacombat.html 及 https://www.biostars.org/p/196430/ 很容易看懂什么是批次矫正。
更重要的是这 5 个数据集的临床信息,都被重新归纳总结啦:
cinfo <- colnames(pData(mainz7g))
> cinfo
[1] "samplename" "dataset" "series" "id"
[5] "filename" "size" "age" "er"
[9] "grade" "pgr" "her2" "brca.mutation"
[13] "e.dmfs" "t.dmfs" "node" "t.rfs"
[17] "e.rfs" "treatment" "tissue" "t.os"
[21] "e.os"
真的是非常棒的数据集!!!