如果传统bulk转录组数据队列足够大也可以使用单细胞流程

昨天我在生信菜鸟团分享的学徒数据挖掘任务: 不一定正确的多分组差异分析结果热图展现提到了可以使用单细胞转录组数据分析流程来处理文献的数据集。

还给出了一些简单代码，就是看看样本聚类情况，然后留下作业给另外一个学徒，看单细胞R包Seurat的FindAllMarkers函数对7个亚型找到的marker基因，根据传统的bulk转录组差异分析策略的差异。

先看看单细胞转录组代码

这里我们的单细胞转录组数据分析方法，基本上遵循我的全网第一个单细胞课程（基础）满一千份销量就停止发售内容，就是一些R包的认知，包括 scater,monocle,Seurat,scran,M3Drop 需要熟练掌握它们的对象，：一些单细胞转录组R包的对象，分析流程也大同小异：

step1: 创建对象
step2: 质量控制
step3: 表达量的标准化和归一化
step4: 去除干扰因素(多个样本整合)
step5: 判断重要的基因
step6: 多种降维算法
step7: 可视化降维结果
step8: 多种聚类算法
step9: 聚类后找每个细胞亚群的标志基因
step10: 继续分类

代码如下：

library(Seurat)
# https://satijalab.org/seurat/mca.html
sce <- CreateSeuratObject(counts = sarc_choose, 
 meta.data =clinic,
 min.cells = 5, min.features = 1000, 
 project = "sce")
head(sce@meta.data) 
# 步骤 ScaleData 的耗时取决于电脑系统配置（保守估计大于一分钟）
sce <- ScaleData(object = sce, 
 vars.to.regress = c('nCount_RNA'), 
 model.use = 'linear', 
 use.umi = FALSE)
sce <- FindVariableFeatures(object = sce, 
 mean.function = ExpMean, 
 dispersion.function = LogVMR, 
 x.low.cutoff = 0.0125, 
 x.high.cutoff = 4, 
 y.cutoff = 0.5)
length(VariableFeatures(sce)) 
sce <- RunPCA(object = sce, pc.genes = VariableFeatures(sce))
# 下面只是展现不同降维算法而已，并不要求都使用一次。
sce <- RunICA(sce )
sce <- RunTSNE(sce )
#sce <- RunUMAP(sce,dims = 1:10)
#VizPCA( sce, pcs.use = 1:2)
DimPlot(object = sce, reduction = "pca") 
DimPlot(object = sce, reduction = "ica")
DimPlot(object = sce, reduction = "tsne",group.by = 'short.histo')

# 针对PCA降维后的表达矩阵进行聚类 FindNeighbors+FindClusters 两个步骤。
sce <- FindNeighbors(object = sce, dims = 1:20, verbose = FALSE) 
sce <- FindClusters(object = sce, resolution = 0.8,verbose = FALSE)
# 继续tSNE可视化
DimPlot(object = sce, reduction = "tsne",
 group.by = 'RNA_snn_res.0.8',label = T)
table(sce@meta.data$short.histo,sce@meta.data$RNA_snn_res.0.8)

Idents(sce)=sce@meta.data$short.histo

sce.markers <- FindAllMarkers(object = sce, only.pos = TRUE, min.pct = 0.25, 
 thresh.use = 0.25)
DT::datatable(sce.markers)
library(dplyr)
sce.markers=sce.markers[!is.infinite(sce.markers$avg_logFC),]
sce.markers %>% group_by(cluster) %>% top_n(2, avg_logFC)
topN <- sce.markers %>% group_by(cluster) %>% top_n(50, avg_logFC)
DoHeatmap(sce,topN$gene)

热图如下：

可以看到R包Seurat的FindAllMarkers函数对7个亚型找到的marker基因基本上都是上调基因。

检查单细胞转录组和bulk差异分析结果重合情况

首先bulk差异分析策略见: 不一定正确的多分组差异分析结果热图展现，其实就是我们以前在生信技能树分享的一个策略：如果你的分组比较多，差异分析策略有哪些？把其中一个亚型跟所有的其它样本进行常规转录组差异分析而已。差异分析结果如下：

我们的单细胞转录组分析结果的，topN 这个变量，就存储着其差异分析结果：

现在要比较这两个表格！

单细胞FindAllMarkers并不是简单取差异最大的基因

通常，我们对传统bulk转录组差异分析结果，可以选取top的上下调基因进行热图可视化，如下：

但是，我们上面单细胞流程的R包Seurat的FindAllMarkers函数对ULMS亚型找到的marker基因，却并不满足这个传统bulk转录组差异分析统计学显著指标，比如logFC大于2，并且校正后的P值小于0.05

不过，也是可以看到这样的规律，就是挑选的基因可以显著的把绝大部分ULMS 亚型病人跟其他病人区分开来。

但是，这些基因，在limma的voom算法的差异分析结果里面，只有非常少的几个基因的阈值是达标的：

绝大部分基因的logFC机会没有啥意义。

其实FindAllMarkers找基因还有很多参数可以选择

有趣的是，就是没有limma的voom算法！

"wilcox" : Identifies differentially expressed genes between two groups of cells using a Wilcoxon Rank Sum test (default)
"bimod" : Likelihood-ratio test for single cell gene expression, (McDavid et al., Bioinformatics, 2013)
"roc" : Identifies 'markers' of gene expression using ROC analysis. For each gene, evaluates (using AUC) a classifier built on that gene alone, to classify between two groups of cells. An AUC value of 1 means that expression values for this gene alone can perfectly classify the two groupings (i.e. Each of the cells in cells.1 exhibit a higher level than each of the cells in cells.2). An AUC value of 0 also means there is perfect classification, but in the other direction. A value of 0.5 implies that the gene has no predictive power to classify the two groups. Returns a 'predictive power' (abs(AUC-0.5) * 2) ranked matrix of putative differentially expressed genes.
"t" : Identify differentially expressed genes between two groups of cells using the Student's t-test.
"negbinom" : Identifies differentially expressed genes between two groups of cells using a negative binomial generalized linear model. Use only for UMI-based datasets
"poisson" : Identifies differentially expressed genes between two groups of cells using a poisson generalized linear model. Use only for UMI-based datasets
"LR" : Uses a logistic regression framework to determine differentially expressed genes. Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test.
"MAST" : Identifies differentially expressed genes between two groups of cells using a hurdle model tailored to scRNA-seq data. Utilizes the MAST package to run the DE testing.
"DESeq2" : Identifies differentially expressed genes between two groups of cells based on a model using DESeq2 which uses a negative binomial distribution (Love et al, Genome Biology, 2014).This test does not support pre-filtering of genes based on average difference (or percent detection rate) between cell groups. However, genes may be pre-filtered based on their minimum detection rate (min.pct) across both cell groups. To use this method, please install DESeq2, using the instructions at https://bioconductor.org/packages/release/bioc/html/DESeq2.html

感兴趣的朋友，可以一一尝试，当然了，还不如系统性学习下统计学知识哦

可是，统计学不比尝试这些参数要简单其实。

一	二	三	四	五	六	日
« 九
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30