好久没有写TCGA数据库教程了,因为TCGA计划早在2017年就陆陆续续停止了,我那个时候写了几百个教程并且录制了视频。
- 我三年前的TCGA教学视频课程B站地址:https://www.bilibili.com/video/av49363776
- 售后文档记录 https://docs.qq.com/doc/DYkVzUmZLWlhRRXVz 请先通读文档后再发问
- 我这边备份的TCGA数据来源于xena,ucsc的,都在,https://share.weiyun.com/5zLnKmO
安装和加载R包相信已经无需我多说了:
BiocManager::install("curatedTCGAData")
BiocManager::install("TCGAutils")
library(curatedTCGAData)
library(MultiAssayExperiment)
library(TCGAutils)
首先查看TCGA数据库有哪些数据
代码如下:
> curatedTCGAData(diseaseCode = "*", assays = "*", dry.run = TRUE)
Please see the list below for available cohorts and assays
Available Cancer codes:
ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH
KIRC KIRP LAML LGG LIHC LUAD LUSC MESO OV PAAD PCPG
PRAD READ SARC SKCM STAD TGCT THCA THYM UCEC UCS UVM
Available Data Types:
CNACGH CNACGH_CGH_hg_244a
CNACGH_CGH_hg_415k_g4124a CNASeq CNASNP
CNVSNP GISTIC_AllByGene GISTIC_Peaks
GISTIC_ThresholdedByGene Methylation
Methylation_methyl27 Methylation_methyl450
miRNAArray miRNASeqGene mRNAArray
mRNAArray_huex mRNAArray_TX_g4502a
mRNAArray_TX_g4502a_1
mRNAArray_TX_ht_hg_u133a Mutation
RNASeq2GeneNorm RNASeqGene RPPAArray
多种癌症,多种数据类型。
联网下载数据
可以使用 dry.run 控制是否真的下载,因为如果是下载甲基化信号值矩阵或者表达量矩阵,会耗时很长。
curatedTCGAData(diseaseCode = "COAD", assays = "RPPA*", dry.run = TRUE)
# dry.run logical(1) Whether to return the dataset names before actual download (default TRUE)
(accmae <- curatedTCGAData("ACC", c("CN*", "Mutation"), FALSE))
下载后得到的是一个MultiAssayExperiment对象:
A MultiAssayExperiment object of 3 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 3:
[1] ACC_CNASNP-20160128: RaggedExperiment with 79861 rows and 180 columns
[2] ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 180 columns
[3] ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
可以看到是一个简单正则表达式,匹配。这个MultiAssayExperiment对象非常重要:
The MultiAssayExperiment
class can be used to manage results of diverse assays on a collection of specimen. Currently, the class can handle assays that are organized instances of SummarizedExperiment
, ExpressionSet
, matrix
, RaggedExperiment
(inherits from GRangesList
), and RangedVcfStack
. Create new MultiAssayExperiment
instances with the homonymous constructor, minimally with the argument ExperimentList
, potentially also with the arguments colData
(see section below) and sampleMap
.
获取临床属性
病人多组学数据必须要有临床信息,才能活起来。
head(getSubtypeMap(accmae))
head(getClinicalNames("ACC"))
colData(accmae)[, getClinicalNames("ACC")][1:5, 1:5]
sampleTables(accmae)
可以看到样本可以区分成为 01 10 11 代表 是否是肿瘤样品。
> sampleTables(accmae)
$`ACC_CNASNP-20160128`
01 10 11
90 85 5
$`ACC_CNVSNP-20160128`
01 10 11
90 85 5
$`ACC_Mutation-20160128`
01
90
> sampleTypes[sampleTypes[["Code"]] %in% c("01", "10"), ]
Code Definition Short.Letter.Code
1 01 Primary Solid Tumor TP
10 10 Blood Derived Normal NB
提取肿瘤相关数据
前面的 01 10 11 代表 是否是肿瘤样品,就可以提取。
splitAssays(accmae, c("01", "10"))
tums <- TCGAsampleSelect(colnames(accmae), "01")
accmae[, tums, ]
写出文件
只需要 exportClass 函数即可,把前面的 MultiAssayExperiment对象 里面的数据写出来。
exportClass(accmae, dir = './', fmt = "csv", ext = ".csv")
Writing about 7 files to disk...
[1] ".//accmae_META_0.csv"
[2] ".//accmae_META_1.csv"
[3] ".//accmae_ACC_CNASNP-20160128.csv"
[4] ".//accmae_ACC_CNVSNP-20160128.csv"
[5] ".//accmae_ACC_Mutation-20160128.csv"
[6] ".//accmae_colData.csv"
[7] ".//accmae_sampleMap.csv"
实战
比如提取TCGA数据库的BRCA数据集的TNBC亚型的表达量矩阵。
前面我们提到过,如果是下载甲基化信号值矩阵或者表达量矩阵,会耗时很长。如果是去ucsc的xena浏览器下载,是一个130M左右的压缩包文件。
(a <- curatedTCGAData("BRCA", c("RNASeq2GeneNorm"), FALSE))
head(getSubtypeMap(a))
head(getClinicalNames("BRCA"))
信息如下:
> head(getSubtypeMap(a))
BRCA_annotations BRCA_subtype
1 Patient_ID patientID
2 mrna_subtypes PAM50 mRNA
3 mrna_subtypes SigClust Unsupervised mRNA
4 mrna_subtypes SigClust Intrinsic mRNA
5 microrna_subtypes miRNA Clusters
6 methylation_subtypes methylation Clusters
> head(getClinicalNames("BRCA"))
[1] "years_to_birth" "vital_status" "days_to_death" "days_to_last_followup"
[5] "tumor_tissue_site" "pathologic_stage"
简单的搜索一下 ER.Status PR.Status HER2.Final.Status
phe=as.data.frame(colData(a))
# Gender Age.at.Initial.Pathologic.Diagnosis ER.Status PR.Status HER2.Final.Status
> head( phe[,colnames(phe)[grep('Status', colnames(phe),ignore.case = T)][56:59] ])
ER.Status PR.Status HER2.Final.Status Vital.Status
TCGA-A1-A0SB Positive Negative Negative LIVING
TCGA-A1-A0SD Positive Positive Negative LIVING
TCGA-A1-A0SE Positive Positive Negative LIVING
TCGA-A1-A0SF Positive Positive Negative LIVING
TCGA-A1-A0SG Positive Positive Negative LIVING
TCGA-A1-A0SH Negative Positive Negative LIVING
整体评价:用起来还行,但也不是非他不可!