这个教程其实就是简单运行一个例子,建议大家看原作者的英文文档
如果你还没有安装,就运行下面的代码安装:
source('http://bioconductor.org/biocLite.R');
biocLite('DEXSeq');
biocLite('pasilla')
如果你安装好了,就直接加载它们即可
suppressPackageStartupMessages(library("DEXSeq"))
suppressPackageStartupMessages(library("pasilla"))
inDir <- system.file("extdata", package="pasilla")
countFiles <- list.files(inDir, pattern="fb.txt$", full.names=TRUE)
countFiles
## [1] "C:/Users/jimmy1314/Documents/R/win-library/3.3/pasilla/extdata/treated1fb.txt"
## [2] "C:/Users/jimmy1314/Documents/R/win-library/3.3/pasilla/extdata/treated2fb.txt"
## [3] "C:/Users/jimmy1314/Documents/R/win-library/3.3/pasilla/extdata/treated3fb.txt"
## [4] "C:/Users/jimmy1314/Documents/R/win-library/3.3/pasilla/extdata/untreated1fb.txt"
## [5] "C:/Users/jimmy1314/Documents/R/win-library/3.3/pasilla/extdata/untreated2fb.txt"
## [6] "C:/Users/jimmy1314/Documents/R/win-library/3.3/pasilla/extdata/untreated3fb.txt"
## [7] "C:/Users/jimmy1314/Documents/R/win-library/3.3/pasilla/extdata/untreated4fb.txt"
## 共七个样本,可以从文件名看到样本描述
head(read.table(countFiles[1]))
## V1 V2
## 1 FBgn0000003:001 0
## 2 FBgn0000008:001 0
## 3 FBgn0000008:002 0
## 4 FBgn0000008:003 0
## 5 FBgn0000008:004 1
## 6 FBgn0000008:005 4
## 简单看一下每个样本的数据要求,其实就是每个基因的每个外显子的reads数量
## 这个文件非常容易制作,所以我就没有仔细讲了,包自带python脚本可以,HTseq-count软件也可以
gffFile <- list.files(inDir, pattern="gff$", full.names=TRUE)
gffFile
## [1] "C:/Users/jimmy1314/Documents/R/win-library/3.3/pasilla/extdata/Dmel.BDGP5.25.62.DEXSeq.chr.gff"
##注意,如果是自己的数据的话,比如之前示例使用的是DEXSeq.hg19.gene.gff,这里就用DEXSeq.hg19.gene.gff
##实验设计
sampleTable <- data.frame(row.names=c(paste("treated", 1:3, sep=""), paste("untreated", 1:4, sep="")),
condition=rep(c("knockdown", "control"), c(3, 4)))
sampleTable
## condition
## treated1 knockdown
## treated2 knockdown
## treated3 knockdown
## untreated1 control
## untreated2 control
## untreated3 control
## untreated4 control
## 对自己的数据,完全模仿这3个文件即可。
## 这个例子是简单分组,就一个变量两个水平而已。
就是利用上面准备好的3类文件即可
dxd <- DEXSeqDataSetFromHTSeq(
countFiles,
sampleData=sampleTable,
design= ~sample + exon + condition:exon,
flattenedfile=gffFile
)
## converting counts to integer mode
dxd
## class: DEXSeqDataSet
## dim: 70463 14
## metadata(1): version
## assays(1): counts
## rownames(70463): FBgn0000003:E001 FBgn0000008:E001 ...
## FBgn0261575:E001 FBgn0261575:E002
## rowData names(5): featureID groupID exonBaseMean exonBaseVar
## transcripts
## colnames: NULL
## colData names(3): sample condition exon
class(dxd)
## [1] "DEXSeqDataSet"
## attr(,"package")
## [1] "DEXSeq"
## 对自己的数据,完全模仿这个形式即可
dxr <- DEXSeq(dxd)
## Warning in MulticoreParam(workers = 1): MulticoreParam() not supported on
## Windows, use SnowParam()
## using supplied model matrix
## using supplied model matrix
## using supplied model matrix
## 这一步非常耗时的,就是一步法做差异分析,文末会给出分步法做差异分析。
dxr; class(dxr)
##
## LRT p-value: full vs reduced
##
## DataFrame with 70463 rows and 13 columns
## groupID featureID exonBaseMean dispersion
## <character> <character> <numeric> <numeric>
## FBgn0000003:E001 FBgn0000003 E001 0.1561875 NA
## FBgn0000008:E001 FBgn0000008 E001 0.0000000 NA
## FBgn0000008:E002 FBgn0000008 E002 0.1876242 0.5297885
## FBgn0000008:E003 FBgn0000008 E003 0.5927352 0.1518155
## FBgn0000008:E004 FBgn0000008 E004 0.5210598 0.1654008
## ... ... ... ... ...
## FBgn0261574:E018 FBgn0261574 E018 47.3948427 0.03858804
## FBgn0261574:E019 FBgn0261574 E019 32.2179726 0.01347702
## FBgn0261574:E020 FBgn0261574 E020 1.0608305 0.76404513
## FBgn0261575:E001 FBgn0261575 E001 0.2829753 0.49345042
## FBgn0261575:E002 FBgn0261575 E002 3.1194212 0.09138015
## stat pvalue padj control knockdown
## <numeric> <numeric> <numeric> <numeric> <numeric>
## FBgn0000003:E001 NA NA NA NA NA
## FBgn0000008:E001 NA NA NA NA NA
## FBgn0000008:E002 0.27117898 0.6025420 NA 0.920173 0.02585977
## FBgn0000008:E003 0.48663541 0.4854320 NA 1.556206 1.02678796
## FBgn0000008:E004 0.04775949 0.8270088 NA 1.268759 1.43519216
## ... ... ... ... ... ...
## FBgn0261574:E018 0.35346198 0.55215990 1 11.0460832 10.208065
## FBgn0261574:E019 3.36550262 0.06657527 1 8.6145567 10.027165
## FBgn0261574:E020 0.49763865 0.48053952 1 2.1691485 1.041570
## FBgn0261575:E001 -0.00166349 1.00000000 NA 0.9660491 1.263417
## FBgn0261575:E002 0.03334011 0.85511758 1 3.4424481 3.340774
## log2fold_knockdown_control genomicData
## <numeric> <GRanges>
## FBgn0000003:E001 NA chr3R:2648220-2648518:+
## FBgn0000008:E001 NA chr2R:18024494-18024495:+
## FBgn0000008:E002 -10.310117 chr2R:18024496-18024531:+
## FBgn0000008:E003 -1.206037 chr2R:18024532-18024713:+
## FBgn0000008:E004 0.357708 chr2R:18024938-18025504:+
## ... ... ...
## FBgn0261574:E018 -0.3038143 chr3L:20017256-20017387:+
## FBgn0261574:E019 0.5523911 chr3L:20017449-20017613:+
## FBgn0261574:E020 -2.1332654 chr3L:20017614-20017707:+
## FBgn0261575:E001 0.7773539 chr3R:21170483-21170777:-
## FBgn0261575:E002 -0.0896365 chr3R:21170835-21172634:-
## countData transcripts
## <matrix> <list>
## FBgn0000003:E001 0:0:1:... FBtr0081624
## FBgn0000008:E001 0:0:0:... FBtr0100521
## FBgn0000008:E002 0:0:0:... FBtr0071763,FBtr0100521
## FBgn0000008:E003 0:1:0:... FBtr0071763
## FBgn0000008:E004 1:0:1:... FBtr0071764
## ... ... ...
## FBgn0261574:E018 59:38:39:... FBtr0100509,FBtr0074906
## FBgn0261574:E019 87:25:24:... FBtr0100509,FBtr0074906
## FBgn0261574:E020 0:0:1:... FBtr0100509
## FBgn0261575:E001 2:0:0:... FBtr0084847
## FBgn0261575:E002 10:1:3:... FBtr0084847
## [1] "DEXSeqResults"
## attr(,"package")
## [1] "DEXSeq"
plotMA(dxr)
mcols(dxr)$description
## [1] "group/gene identifier"
## [2] "feature/exon identifier"
## [3] "mean of the counts across samples in each feature/exon"
## [4] "exon dispersion estimate"
## [5] "LRT statistic: full vs reduced"
## [6] "LRT p-value: full vs reduced"
## [7] "BH adjusted p-values"
## [8] "exon usage coefficient"
## [9] "exon usage coefficient"
## [10] "relative exon usage fold change"
## [11] "GRanges object of the coordinates of the exon/feature"
## [12] "matrix of integer counts, of each column containing a sample"
## [13] "list of transcripts overlapping with the exon"
## 差异分析结果的每一列的介绍
table ( dxr$padj < 0.1 ) # how many exonic regions are significant
##
## FALSE TRUE
## 42984 234
table ( tapply( dxr$padj < 0.1, dxr$groupID, any ) ) # how many genes are affected
##
## FALSE TRUE
## 5220 166
这些图片很好理解,我就不多说啦
plotDEXSeq( dxr, "FBgn0000210", legend=TRUE, cex.axis=1.2, cex=1.3,
lwd=2 )
## visualize the transcript models
plotDEXSeq( dxr, "FBgn0000210", displayTranscripts=TRUE, legend=TRUE,
cex.axis=1.2, cex=1.3, lwd=2 )
# look at the count values from the individual samples,
plotDEXSeq( dxr, "FBgn0000210", expression=FALSE, norCounts=TRUE,
legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )
# remove overall changes in expression from the plots. Use the (somewhat misnamed) option
# splicing=TRUE for this purpose.
plotDEXSeq( dxr, "FBgn0000210", expression=FALSE, splicing=TRUE,
legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )
产生约75M的一个文件夹,里面有网页超链接,把所有的显著性差异的基因的相关图片全部画了一般。
# create a result table with links to plots for the significant results
DEXSeqHTML( dxr, FDR=0.1, color=c("#FF000080", "#0000FF80") )
自己看文档,这个需求不多
自己看文档,讲起来比较麻烦,尤其是多个因子多个水平
下面的代码不需要运行,等价于上面的 dxr <- DEXSeq(dxd)
if(F){
dxd <- estimateSizeFactors(dxd) #第一步
dxd <- estimateDispersions(dxd) #第二步,此时可以使用plotDispEsts(dxd)来观察离散情况
dxd <- testForDEU(dxd) #第三步,The function testForDEU performs these tests for each exon in each gene.
dxd <- estimateExonFoldChanges(dxd, fitExptoVar="condition")
dxr1 <- DEXSeqResults(dxd) #可以使用plotMA(dxr1)来查看结果
}