先了解背景知识:http://www.bio-info-trainee.com/3263.html 并且下载数据
wget http://genemed.uchicago.edu/~pgeeleher/cgpPrediction/paper.zip ## (161M) unzip paper.zip
首先了解数据并且读取到R里面查看
代码是:
setwd("pcaAnalysis/") gdsc_mutDat <- read.csv("../Data/GdscPdata/gdsc_manova_input_w2.csv", as.is=TRUE) gdsc_mutDat[1:6,1:6] rownames(gdsc_mutDat) <- gdsc_mutDat[,1] load(file="../Data/GdscProcData/gdsc_brainarray_syms.RData") dim(gdsc_brainarray_syms) gdsc_brainarray_syms[1:6,1:3] pData <- read.delim("../Data/GdscPdata/E-MTAB-783.sdrf.txt", as.is=TRUE) pData[1:6,1:4]
记录着 715 个细胞系的一些基因的点突变及拷贝数变异信息,以及若干表型信息,还有大量的药物的IC50值,简单查看如下:
> gdsc_mutDat[1:6,1:6] Cell.Line Cosmic_ID Tissue Cancer.Type MS.HL AKT2 1 NA NA 2 MC-CAR 683665 Myeloma blood 0 na::0<cn<8 3 PFSK-1 683667 medulloblastoma CNS 0 na::0<cn<8 4 A673 684052 rhabdomyosarcoma (putative Ewing's) soft tissue 0 na::0<cn<8 5 ES3 684055 Ewings sarcoma bone 0 na::0<cn<8 6 ES5 684057 Ewings sarcoma bone 0 na::0<cn<8
药物IC50值查看如下:
colsIc50 <- which(regexpr("_IC_50$", colnames(gdsc_mutDat)) > 0) allIc50 <- gdsc_mutDat[-1, colsIc50] Ic50FullSet <- data.matrix(allIc50[apply(allIc50, 1, function(r)return(sum(r == ""))) == 0, ]) dim(Ic50FullSet) Ic50FullSet[1:4,1:4]
只有117个细胞系有着全部的 138个药物处理信息:
> Ic50FullSet[1:4,1:4] Erlotinib_IC_50 Rapamycin_IC_50 Sunitinib_IC_50 PHA.665752_IC_50 ES3 5.4575 2.712700 3.6244 5.7420 ES5 6.1779 2.713300 3.2082 2.6719 ES7 5.3345 2.920100 5.1787 4.0530 EW-11 3.9545 0.083814 3.5402 5.1912
上面的不是芯片表达矩阵,而是各个药物在各个细胞系的IC50值矩阵。
还有789个细胞系的表型记录(18项)信息:
> pData[1:6,1:4] Source.Name Material.Type Characteristics.Organism. Characteristics.CellLine. 1 380 cell Homo sapiens 380 2 697 cell Homo sapiens 697 3 5637 cell Homo sapiens 5637 4 22RV1 cell Homo sapiens 22RV1 5 23132-87 cell Homo sapiens 23132-87 6 639-V cell Homo sapiens 639-V colnames(pData) [1] "Source.Name" "Material.Type" "Characteristics.Organism." [4] "Characteristics.CellLine." "Characteristics.DiseaseState." "Protocol.REF" [7] "Protocol.REF.1" "Extract.Name" "Protocol.REF.2" [10] "Labeled.Extract.Name" "Label" "Protocol.REF.3" [13] "Term.Source.REF" "Hybridization.Name" "Array.Design.REF" [16] "Array.Data.File" "Comment..ArrayExpress.FTP.file." "Factor.Value.CELL_LINE."
以及789个细胞系的芯片表达矩阵:
> dim(gdsc_brainarray_syms) [1] 12092 789 > gdsc_brainarray_syms[1:6,1:3] 5500024030401071707289.A01.CEL 5500024030401071707289.A02.CEL 5500024030401071707289.A03.CEL NAT2 4.012023 4.144823 4.445793 ADA 6.153445 7.890772 7.160563 CDH2 5.270275 7.594583 5.327693 AKT3 5.766602 5.632153 4.687145 MED6 7.750406 7.395117 7.311555 NR2E3 4.692599 4.326632 4.214958 >
直接根据全表达矩阵做主成分分析
总共的666个细胞系
pcsAll <- prcomp(t(trainDataOrd))$x colSup1 <- unclass(factor(cancerTypesOrd)) pchSup1 <- unclass(factor(cancerTypesOrd)) #pdf("suppFig1.pdf", width=12, height=10) par(mar=c(5.1, 4.1, 4.1, 5.7), xpd=TRUE) plot(pcsAll, pch=pchSup1, col=colSup1, xlab="Principle component 1", ylab="Principle component 2") legend("topleft", sapply(as.character(levels(factor(cancerTypesOrd))), makeCapital), col=seq(1:16), pch=seq(1:16), cex=.6, inset=c(1,0)) #dev.off()
抽取血液相关细胞系做主成分分析
总共是88个细胞系,列表如下:
haemTissue Freq 1 AML 15 2 B cell leukemia 7 3 B cell lymphoma 9 4 Burkitt lymphoma 9 5 CML 5 6 Hodgkin lymphoma 4 7 Myeloma 7 8 haematopoietic_neoplasm other 1 9 hairy_cell_leukaemia 1 10 leukemia 2 11 lymphoblastic T cell leukaemia 9 12 lymphoblastic leukemia 10 13 lymphoid_neoplasm other 9
绘图代码是:
pcsHaem <- prcomp(t(trainDataOrd[, which(cancerTypesOrd == "blood")]))$x haemIndex <- which(cancerTypesOrd == "blood") haemTissue <- gdsc_mutDat[commonCellLines[haemIndex], "Tissue"] mycol <- unclass(factor(haemTissue)) mypch <- unclass(factor(haemTissue)) #pdf("suppFig2.pdf", width=12, height=10) par(mar=c(5.1, 4.1, 4.1, 7.4), xpd=TRUE) plot(pcsHaem[,c(1,2)], col=mycol, pch=mypch, xlab="Principle component 1", ylab="Principle component 2") legend("topleft", sapply(as.character(levels(factor(haemTissue))), makeCapital), col=seq(1:13), pch=seq(1:13), cex=.5, inset=c(1,0)) #dev.off()
还可以根据mut信息及CNV信息来抽取样本进行PCA分析并且着色看看该分类是否与主成分分析结果对应。