一个甲基化芯片数据被挖掘好几次(学徒作业)

前面我在《生信技能树》的教程:什么,你感兴趣的GEO数据集没有关联到原始文献出处,提到了一个GSE数据集是可以关联到很多文献,如果这个数据集被挖掘过。但是举例子的时候留空白了,居然被眼尖的读者指出来了。其实写教程有时候很耗费时间,我不想为了一个教程再去临时查询资料做整理,但是即使不举例,相信大家也是能看懂的。恰好我最近看到了一个数据集就关联到了3个文章,数据在:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE66313

很简单的设计,就是450K甲基化芯片:DCIS (n=40) and adjacent normal (n=15) ,另外的信息技术:Among 40 DCIS cases 13 later developed invasive disease

实验结论很清晰:This work contributes to the understanding of epigenetic alterations that occur in DCIS and illustrates the potential of DNA methylation as markers of DCIS progression.

文章包括:

  • DNA methylation in ductal carcinoma in situ related with future development of invasive breast cancer. Clin Epigenetics 2015;7:75. PMID: 26213588
  • Concordance of DNA methylation profiles between breast core biopsy and surgical excision specimens containing ductal carcinoma in situ (DCIS). Exp Mol Pathol 2017 Aug;103(1):78-83. PMID: 28711544
  • Genome-wide characterization of cytosine-specific 5-hydroxymethylation in normal breast tissue. Epigenetics 2020 Apr;15(4):398-418. PMID: 31842685

学徒作业(一)

这3个文章都需要写文献解读,参考解读模式见:

文献解读示例

多个GSE数据集,走差异基因路线

比如下面3篇文章,都是关于肺癌的数据挖掘文章,而且是整合多个GSE数据集,走的是差异基因路线。

首先需要理解肺癌的生物学背景知识,组织病理上通常将肺癌分为

  • 非小细胞肺癌(non-small-cell lung cancer,NSCLC)
  • 小细胞肺癌(small cell lung cancer,SCLC

其中SCLC约占全部肺癌的15%~20%,SCLC的发病与吸烟密切相关,生物学特征为分化程度低、恶性程度高、倍增时间快、侵袭性强、预后差,中位生存期才7个月左右。

其中NSCLC又可以区分为LUSC和LUAD,鳞癌和腺癌的差异。

第一篇文献是:Front. Genet., 12 October 2018 | https://doi.org/10.3389/fgene.2018.00469
  • 文章标题:Identification of Candidate Biomarkers Correlated With the Pathogenesis and Prognosis of Non-small Cell Lung Cancer via Integrated Bioinformatics Analysis
  • 纳入4个数据集: (GSE18842, GSE19804, GSE43458, and GSE62113)
  • 使用limma包寻找显著的differentially expressed genes (DEGs)
  • 使用RobustRankAggreg (RRA)整合多个数据集的差异分析结果
  • GO和KEGG数据库注释差异分析结果
  • 使用STRING数据库搜索差异基因集的PPI网络
  • 使用Cytoscape, and Molecular Complex Detection (MCODE)寻找PPI网络的hub基因:OP2A, CCNB1, CCNA2, UBE2C, KIF20A, and IL-6
  • 使用 Gene Expression Profiling Interactive Analysis (GEPIA) 网页工具检验hub基因是否具有泛癌效应
  • 使用网络数据进行 Kaplan Meier-plotter (KM) 分析hub基因是否具有生存预测能力
第二篇文献是:Mol Med Rep. 2018 May; 17(5): 6379–6386.
  • 文章标题:Identification of key differentially expressed genes associated with non-small cell lung cancer by bioinformatics analyses
  • 纳入4个数据集 : GSE21933, GSE33532, GSE44077 and GSE74706
    • 21 tumor samples and 21 normal samples for GSE21933
    • 80 tumor samples and 20 normal samples for GSE33532
    • 65 tumor samples and 65 normal samples for GSE44077
    • 18 tumor samples and 18 normal samples for GSE74706
  • 各个数据集分别做差异分析挑选显著的(DEGs) ,阈值都是 (adjust P-value <0.05 and |log2fold-change (FC)|>1)
  • 对4个数据集的差异分析结果找重合部分,韦恩图展现
  • GO和KEGG数据库注释差异分析结果
  • 使用STRING数据库搜索差异基因集的PPI网络
  • 使用DEGs with a degree score ≥19 阈值判定hub基因:CCNB1, CCNA2, CEP55, PBK and HMMR
  • 使用网络数据进行 Kaplan Meier-plotter (KM) 分析hub基因是否具有生存预测能力
第三篇文献是:Published: 26 October 2018
  • 文章标题:Transcriptomic and functional network features of lung squamous cell carcinoma through integrative analysis of GEO and TCGA data
  • 纳入7个数据集是:GSE8569, GSE21933, GSE33479, GSE33532, GSE40275, GSE62113, GSE74706
  • 对GSE数据集,统一使用limma包,阈值为(|Log2FC| > 2, adjusted p-value < 0.05) 来选择显著差异表达基因
  • 把所有7个数据集样本合并使用SVA包的combat函数去除批次效应重新使用limma包选择显著差异表达基因
  • 对TCGA数据库的502 tumors and 49 adjacent non-tumor选择差异基因
  • 整合GEO和TCGA数据库得到 129 genes (91 up-regulated and 38 down-regulated)
  • 与前两个文章同样的下游分析得到hub基因,这次有点多,14个 :CCNB2, PLK1, KIF2C, CENPA, CENPF, BUB1, BUB1B, BIRC5, CENPE, ZWINT, AURKB, CHEK1, EXO1, RAD51, and RFC4
  • 对TCGA数据库的LUSC使用GDCRNAtools选择: a total of 124 DElncRNAs (|Log2FC| > 2, FDR < 0.05) and 74 DEmiRNAs (|Log2FC| > 2, FDR < 0.05) ,构建ceRNA network
  • 使用 Cytoscape 展示ceRNA network ,共 25 lncRNAs, 14 miRNAs and 14 mRNAs

学徒作业(二)

做3次甲基化芯片差异分析,首先需要阅读我在生信技能树的甲基化系列教程,目录如下

然后就可以看我在B站免费分享的视频课程《甲基化芯片(450K或者850K)数据处理 》,详见:免费视频课程《甲基化芯片数据分析》

3次甲基化芯片差异分析包括:

  • DCIS (n=40) VS adjacent normal (n=15)
  • 13 later developed invasive disease VS 27 other DCIS
  • 13 later developed invasive disease VS adjacent normal (n=15)
每个甲基化差异比较都需要标准5个图表

甲基化芯片的差异分析的标准图表来源于参考文献《Inflammatory cytokines shape a changing DNA methylome in monocytes mirroring disease activity in rheumatoid arthritis》:

  • (D) DNA methylation heatmap of CD15-CD33+CD11b+ cells isolated from RA and HD peripheral blood. The heatmap includes all CpG- containing probes displaying significant methylation changes (FDR<0.05). A scale is shown at the bottom ranging from −4 (lower DNA methylation levels, blue) to +4 (higher methylation levels, red).
  • (E) Representation of selected gene ontology categories obtained from the analysis of differentialy methylated CpG sites comparing HD and RA samples using Genomic Regions Enrichment of Annotations Tool.
  • (F) Beta values showing methylation of individual CpGs in both the hypermethylated and the hypomethylated CpG sets. A schematic representation of each gene is depicted. Arrows refer to TSS and transcription direction (in red the analysed CpGs location).
  • (G) Analysis of differentially variable positions identified with iEVORA algorithm. Significant DVPs are those with a t-test value <0.01 and an adjusted Bartlett test value <0.01.
  • (H) DNA methylation plot of selected genes displaying DNA methylation variability in HD and RA group of samples. Mann-Whitney tests were used to determine significance (p<0.05, p<0.01 and p<0.001; n.s., not significant). DVP, differentially variable position; FDR, false discovery rate; HD, healthy donor; n.s., not significant; RA, rheumatoid arthritis; TSS, transcriptional start site.

Comments are closed.