但很多时候,数据分析者往往身不由己。
Stanford 大学 在MOOC上面的公开课:PH525x series - Biomedical Data Science 还专门抽一个章节来讲解这个问题,足以见它的重要性。 http://genomicsclass.github.io/book/
Chapter 10 - Batch Effects
- Introduction to batch effects [Rmd]
- Confounding [Rmd]
- Confounding exercises
- EDA with PCA [Rmd]
- EDA with PCA exercises
- Adjusting with linear models [Rmd]
- Adjusting with linear models exercises
- Factor analysis [Rmd]
- Factor analysis exercises
- Adjusting with factor analysis [Rmd]
- Adjusting with factor analysis exercises
最简单的是quantile
library("preprocessCore")
dataMat <- cbind(trainExprMat, testExprMat)
dataMatNorm <- normalize.quantiles(dataMat)
whichbatch <- as.factor(c(rep("train", ncol(trainExprMat)), rep("test", ncol(testExprMat))))
train=dataMatNorm[, whichbatch=="train"]
test=dataMatNorm[, whichbatch=="test"]
很明显,画一下校正前后的 boxplot 就可以看到效果,然后PCA一下,看看是不是两个矩阵很清晰的被分开,如果是,说明校正失败咯。
高级一点是SVA包的ComBat函数
library("sva")
# subset to common genes andbatch correct using ComBat
dataMat <- cbind(trainExprMat, testExprMat)
mod <- data.frame("(Intercept)"=rep(1, ncol(dataMat)))
rownames(mod) <- colnames(dataMat)
whichbatch <- as.factor(c(rep("train", ncol(trainExprMat)), rep("test", ncol(testExprMat))))
combatout <- ComBat(dataMat, whichbatch, mod=mod)
train=combatout[, whichbatch=="train"]
test=combatout[, whichbatch=="test"]
还有limma包也附带了这个功能,就不多介绍了,感兴趣的朋友可以自行在生信技能树论坛搜索看看。