解决SCENIC流程的多线程问题

去年我们在《生信技能树》公众号带领大家一起学习过:SCENIC转录因子分析结果的解读 ,而且前些天我给了一个 [单细胞转录因子分析之SCENIC流程](https://mp.weixin.qq.com/s/pN4qWdUszuGqr8nOJstn8w):,但是单细胞各种交流群很多朋友反映自己follow我们的教程,出现了如下的报错:

Error in .AUCell_calcAUC(geneSets = geneSets, rankings = rankings, nCores = nCores, : 
 Valid 'mctype': 'snow' or 'doMC'

一般来说,遇到了这样的报错说明其实并没有follow我的教程,而是follow的官方教程。因为我自己就是跟着官方文档跑的时候报错了,才进行了一些小的修改,主要是多线程问题,让我一一道来。

因为初始化的时候设置多线程

### Initialize settings
library(SCENIC)
db='cisTarget_databases/'
list.files(db)
# 保证cisTarget_databases 文件夹下面有下载好2个1G的文件
scenicOptions <- initializeScenic(org="hgnc", 
 dbDir=db , nCores=4) 
saveRDS(scenicOptions, file="int/scenicOptions.Rds") 
saveRDS(cellInfo, file="int/cellInfo.Rds")

可以看到前面的 nCores=4 参数,告诉我们的电脑,可以开启4个线程。

runGenie3可以正常多线程

开启多线程,速度超级快!

> runGenie3(exprMat_filtered_log, scenicOptions)
Using 1419 TFs as potential regulators...
Running GENIE3 part 1
Running GENIE3 part 10
Running GENIE3 part 2
Running GENIE3 part 3
Running GENIE3 part 4
Running GENIE3 part 5
Running GENIE3 part 6
Running GENIE3 part 7
Running GENIE3 part 8
Running GENIE3 part 9
Finished running GENIE3.

本来呢,2000多个细胞走这个步骤,需要24小时左右,但是加上了4个线程,一天一夜的时间消耗就变成了一个晚上。

但是runSCENIC_3_scoreCells多线程失败

虽然跑前面的runGenie3可以正常多线程,节省了大量的时间,但是后面runSCENIC_3_scoreCells多线程失败。

报错如下:

> scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_log ) 
20:54 Step 3. Analyzing the network activity in each individual cell
 Number of regulons to evaluate on cells: 81
Biggest (non-extended) regulons: 
 SOX4 (252g)
 STAT1 (141g)
 JUN (99g)
 TAF7 (99g)
 SPI1 (90g)
 RFX3 (78g)
 IRF9 (65g)
 FOS (54g)
 RFX2 (54g)
 YY1 (54g)
Quantiles for the number of genes detected by cell: 
(Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
 min 1% 5% 10% 50% 100% 
 607.00 700.72 1116.20 1721.60 8015.00 12772.00 
Using 4 cores.
Error in .AUCell_calcAUC(geneSets = geneSets, rankings = rankings, nCores = nCores, : 
 Valid 'mctype': 'snow' or 'doMC'

应该是 snow 或者 doMC 这两个包的一些函数更新了,但是SCENIC流程的R代码里面并没有及时更新。(原则上我们应该是去看源代码,解决它,然后去SCENIC流程的R代码的GitHub上面提交一个issue,不过因为时间关系,我们就到此为止啦。)

而且实际上这个runSCENIC_3_scoreCells步骤并不是计算量很大,无需设置多线程,所以可以重新修改为 nCores=1,代码如下:

scenicOptions <- initializeScenic(org="hgnc", 
 dbDir=db , nCores=1)

然后就顺利运行了:

> scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_log ) 
21:01 Step 3. Analyzing the network activity in each individual cell
 Number of regulons to evaluate on cells: 81
Biggest (non-extended) regulons: 
 SOX4 (252g)
 STAT1 (141g)
 JUN (99g)
 TAF7 (99g)
 SPI1 (90g)
 RFX3 (78g)
 IRF9 (65g)
 FOS (54g)
 RFX2 (54g)
 YY1 (54g)
Quantiles for the number of genes detected by cell: 
(Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
 min 1% 5% 10% 50% 100% 
 607.00 700.72 1116.20 1721.60 8015.00 12772.00 
21:02 Finished running AUCell.
21:02 Plotting heatmap...
21:02 Plotting t-SNEs...

是不是超级简单啊!

整理好的全部代码是:

我把前面的代码重新整理了一下,如果你有构建好的单细胞seurat对象,就可以直接走下面的代码:

rm(list = ls()) 
library(Seurat) 
load(file = 'Epithelial_recluster_sce.Rdata')
sce 
phe=sce@meta.data 
mat=sce@assays$RNA@counts
mat[1:4,1:4]
exprMat =as.matrix(mat) 
dim(exprMat)
exprMat[1:4,1:4] 
head(phe)
cellInfo <- phe[,c('seurat_clusters','nCount_RNA' ,'nFeature_RNA' )]
colnames(cellInfo)=c('CellType', 'nGene' ,'nUMI')
head(cellInfo)
table(cellInfo$CellType)
### Initialize settings
library(SCENIC)
db='cisTarget_databases/'
list.files(db)
# 保证cisTarget_databases 文件夹下面有下载好2个1G的文件
scenicOptions <- initializeScenic(org="hgnc", 
 dbDir=db , nCores=4) 
saveRDS(scenicOptions, file="int/scenicOptions.Rds") 
saveRDS(cellInfo, file="int/cellInfo.Rds")

### Co-expression network
genesKept <- geneFiltering(exprMat, scenicOptions)
exprMat_filtered <- exprMat[genesKept, ]
exprMat_filtered[1:4,1:4]
dim(exprMat_filtered)
runCorrelation(exprMat_filtered, scenicOptions)
exprMat_filtered_log <- log2(exprMat_filtered+1) 
runGenie3(exprMat_filtered_log, scenicOptions)

### Build and score the GRN
exprMat_log <- log2(exprMat+1)
scenicOptions@settings$dbs <- scenicOptions@settings$dbs["10kb"] # Toy run settings
scenicOptions <- runSCENIC_1_coexNetwork2modules(scenicOptions)
scenicOptions <- runSCENIC_2_createRegulons(scenicOptions,
 coexMethod=c("top5perTarget")) # Toy run settings
library(doParallel)
scenicOptions <- initializeScenic(org="hgnc", 
 dbDir=db , nCores=1) 
scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_log ) 
scenicOptions <- runSCENIC_4_aucell_binarize(scenicOptions)
tsneAUC(scenicOptions, aucType="AUC") # choose settings

export2loom(scenicOptions, exprMat)
saveRDS(scenicOptions, file="int/scenicOptions.Rds")

完美而且快速的走完了这个SCENIC流程。

我前面在教程 单细胞转录因子分析之SCENIC流程提到的两个解决方案,第一个是对细胞亚群里面的单细胞进行抽样,第二个是 Python (pySCENIC). 教程,开启多线程!其实在R里面也算是解决了一半,我目前还没有去测试,在R里面跟python里面,到底是速度有啥差异,如果都开启了多线程的话。

Comments are closed.