以前我只知道 CIBERSORT 里面是内置了LM22基因集,CIBERSORT是2015年在Nature Methods发表的一个方法,工具在: (http://cibersort.stanford.edu).,这个方法,直接衍生出了一系列数据挖掘文章, 如果你使用 CIBERSORT + bioinformatics 的关键词去搜索:
很难弄清楚到底是说第一次应用了这个数据挖掘套路,不过早在2016发表的文章. Patterns of Immune Infiltration in Breast Cancer and Their Clinical Implications: A Gene-Expression-Based Retrospective Study. PLOS Medicine 13,e1002194.作者研究团队利用CIBERSORT算法推断解析了11,000个乳腺癌(组织转录组芯片或是RNAseq,包括GEO和TCGA)中的22种免疫细胞的占比。
然后 2018-2020是一个爆发期,肺癌,肾癌,肠癌,肝癌基本上都是有十几篇几乎是一模一样的TCGA数据看其转录组数据里面的的22种免疫细胞的占比的文章出来。
- 比如:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7399742/
- 标题:《Immune Cell Infiltration and Identifying Genes of Prognostic Value in the Papillary Renal Cell Carcinoma Microenvironment by Bioinformatics Analysis》
- 两个数据集:
- 两个算法:
- MAlignant Tumor tissue using Expression data (ESTIMATE)
- Cell-type Identification By Estimating Relative Subsets Of known RNA Transcripts (CIBERSORT)
这样的数据挖掘文章里面除了癌症不一样,其余的基本上一模一样。当然了,甚至有一些连癌症也是一样的,数据集不一样。极端情况下,数据集也是一样的,让人无语。
并不是所有人都不思考
比如2021年7月发表在《Briefings in Bioinformatics》的文章《Clinical significance and immunogenomic landscape analyses of the immune cell signature based prognostic model for patients with breast cancer》,https://doi.org/10.1093/bib/bbaa311 就不再局限于CIBERSORT 里面是内置了LM22基因集,而是自己参考了大量文献,如下所示:
The 184 immune cell signatures were collected from diverse resources through an extensive literature search on the website. Of them,
- 25 signatures were obtained from the work of Bindea et al. [26],
- 68 signatures were obtained from the work of Wolf et al. [27],
- 17 signatures were downloaded from the ImmPort database [28],
- 24 T cell signatures were downloaded from the work of Miao et al. [29]
- 22, 10 and 10 signatures were obtained from CIBERSORT [16], MCP-Counter (R package, version 1.1) [30] and ImSig (R package, version 1.0.0) [31], respectively.
More detailed information is listed in the supplementary material and Supplementary Table S1–Supplementary Table S4.
如果你感兴趣这些基因集,可以自己去阅读文献。
当然了,很大程度上,做这么多工作仅仅是因为数据挖掘低垂的果实已经被采摘完毕,如果不下苦功夫,你的绝大部分 代码和图表只能发表在微信公众号里面。