后台有一些粉丝留言好奇怪,说怎么没看到我们分享多组学教程,我们会不会多组学联合分析啊!可能是因为看到我在B站的免费NGS数据处理视频课程合辑,都是单一组学数据吧!
- 免费视频课程《RNA-seq数据分析》
- 免费视频课程《WES数据分析》
- 免费视频课程《ChIP-seq数据分析》
- 免费视频课程《ATAC-seq数据分析》
- 免费视频课程《TCGA数据库分析实战》
- 免费视频课程《甲基化芯片数据分析》
- 免费视频课程《影像组学教学》
- 免费视频课程《LncRNA-seq数据》
- 免费视频课程《GEO数据挖掘》
能问出多组学分析会不会这样的问题的,肯定是初学者啦!
目前绝大部分所谓多组学其实仅仅是
其中一个组学用来对样本分组,然后看另外一个组学的数据在不同分组的差异,如果你仔细思考一下会分析,分组不就是样本的表型可以决定了吗?多组学就是样本的多个表型信息!!!
自己的数据量不够就公共数据库来凑
这篇文献仅仅是aCGH芯片只能拿到CNV信号,数据有点单薄,所以作者结合了CCLE数据库的公共数据的芯片表达矩阵做了一下多组学联合分析。是2014就发表的文章:Molecular Integrative Clustering of Asian Gastric Cell Lines Revealed Two Distinct Chemosensitivity Clusters , 该课题组自己做的是:array comparative genomic hybridization (aCGH) on 18 Asian gastric cell lines.
研究者对CCLE数据库的公共数据进行了八个步骤的处理,一个合格的生物信息学分析着完全可以重写这个过程:
-
step1:Affymetrix U133 Plus2 DNA microarray gene expressions of 27 gastric cancer cell lines (Kato-III, IM95, SNU-620, SNU-16, OCUM-1, NUGC-4, 2313287, HUG1N, MKN45, NCIN87, KE39, AGS, SNU-5, SNU-216, NUGC-3, NUGC-2, MKN74, MKN7, RERFGC1B, GCIY, KE97, Fu97, SH10TC, MKN1, SNU-1, Hs746 T, HGC27) were downloaded from Cancer Cell Line Encyclopedia (CCLE) [16] in March 2013.
-
step2: Robust Multi-array Average (RMA) normalization was performed. Principal component analysis plot show no obvious batch effect.
-
step3: The normalized data is then collapsed by taking the probe sets with highest gene expression.
前三步是为了得到27个胃癌相关细胞系的 mRNA表达矩阵,方法是下载cel文件,用RMA归一化,对多探针基因去最大表达量探针,供后续分析使用!
- step4: Unsupervised hierarchical clustering (1-Spearman distance, average linkage) was performed on the cell lines using the aCGH data.
- Putative driver genes of which copy number aberrations correlated to mRNA gene expression were identified to determine subtypes or clusters that are driven by different mechanisms. This was done using Mann Whitney U-test with p<0.05, and Spearman Correlation Coefficient test with Rho >0.6.
- step5: We then performed consensus clustering[17] on the gene expression data of the 27 gastric cancer cell lines from CCLE using these putative driver genes. We selected k=2 as it gives sufficiently stable similarity matrix.
- step6: In order to assign new samples to this integrative cluster, significance analysis of microarray (SAM) 18 with threshold q<2.0 was used to generate subtype signature based on the mRNA expression data of the 1762 genes from the 27 gastric cancer cell lines in CCLE.
也就是说,这里先用CNV信号数据来聚类,得到putative driver genes(就是CNV和表达量一起被改变的基因),然后再用这些基因的表达数据来再次聚类,分成两类,然后对这两类进行SAM找差异基因。
最后就是功能数据库注释啦,用来说明差异分析结果的意义!
- step7:ssGSEA (single sample GSEA)was used to estimate pathway activities of the gastric cancer cell line in the Molecular Signature Database v3.1 (Msigdb v3.1) [19], [20]. The pathway activities are represented in enrichment scores which were rank normalized to [0.0, 1.0].
- step8:SAM analysis was performed with threshold q<0.2, and fold change >2.0 (for up-regulated pathways), or <0.5 (for down-regulated pathways) to obtain subtype-specific pathways from the 27 gastric cell lines in CCLE.
生物学结论:
- Cells in IC1 have enrichment of genes associated with oxidative phosphorylation and mitochondria functions.
- gastric cells in IC2 are enriched for genes involved in cell signalling
最后的结论同样是“无病呻吟” : In conclusion, combination of aCGH and gene expression analysis to identify potential candidate oncogenes or tumor suppressor genes is a powerful and proven approach that has been reported in other cancer studies.
当然了,这篇文章的工作量肯定不仅仅是两个组学数据的联合分析,还有一些药物试验数据和实验验证,感兴趣的小伙伴可以自行阅读哈。
学徒作业
从TCGA数据库里面定位到BRCA数据集,然后找到BRCA1基因突变的乳腺癌病人,以及BRCA1基因启动子区域高甲基化的乳腺癌病人,看看这两个分组是否有overlap。把病人分组后,看BRCA1基因突变的乳腺癌病人与BRCA1基因启动子区域高甲基化的乳腺癌病人他们的转录组数据的差异!
历年学徒作业目录如下:
- 生信编程直播课程优秀学员作业展示1
- 生信编程直播课程优秀学员学习心得及作业展示3
- 生信编程直播课程优秀学员作业展示2
- 给学徒的GEO作业
- 这个WGCNA作业终于有学徒完成了!
- 上次说的gmt函数(学徒作业)
- 拖后腿学徒居然也完成作业,理解RNA-seq数据分析结果
- 肿瘤外显子视频课程小作业
- ChIPseq视频课程小作业
- Agilent芯片表达矩阵处理(学徒作业)
- 学徒作业:TCGA数据库单基因gsea之COAD-READ
- 学徒作业-在CCLE数据库里面根据指定基因在指定细胞系里面提取表达矩阵
- 学徒作业-指定基因在指定组织里面的表达量热图
- 学徒作业-我想看为什么这几个基因的表达量相关性非常高
- 学徒作业:给你8个甲基化探针, 你在tcga数据库进行任意探索
- 学徒作业-根据我的甲基化视频教程来完成2015-NPC-methy-GSE52068研究
- RNA芯片和测序技术的比较(学徒作业)
- 学徒作业-单基因的tcga数据挖掘分析
- ATCC终于出来了organoids资源
- 拿到7个DDR通路的基因集-学徒作业
- 绘图本身很简单但是获取数据很难
- 都说lncRNA只有部分具有polyA尾结构,请证明
- 学徒作业-hisat2+stringtie+ballgown流程
- 学徒任务-探索DNA甲基化的组织特异性
- 用WES和RNA-Seq数据提取到的somatic SNVs不一致
- 《GEO数据挖掘课程》配套练习题