如果所有的RNA-seq项目都这样提供数据

前面我们发布了 [明码标价之普通转录组上游分析](https://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&mid=2247499208&idx=1&sn=5ef47d5e0d2ebfe61e481e02963bdcef&scene=21#wechat_redirect)，马上就有粉丝提出了需求，是数据集GSE165752，希望我们帮忙走转录组上游分析拿到其表达量矩阵。

但是其实人家本来就是提供了表达量矩阵，链接是：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE165752

项目描述是： C57BL/6N mice，两种类型control diet or 60% high fat diet，然后是with or without DEN injection的两个处理情况，做的是 liver tissue RNA sequencing ，每个条件下是4个重复，所以是 4X2X2=16个数据，如下所示：

```
GSM5049476 CD-Vehicle-1
GSM5049477 CD-Vehicle-2
GSM5049478 CD-Vehicle-3
GSM5049479 CD-Vehicle-4
GSM5049480 CD-DEN-1
GSM5049481 CD-DEN-2
GSM5049482 CD-DEN-3
GSM5049483 CD-DEN-4
GSM5049484 HFD-Vehicle-1
GSM5049485 HFD-Vehicle-2
GSM5049486 HFD-Vehicle-3
GSM5049487 HFD-Vehicle-4
GSM5049488 HFD-DEN-1
GSM5049489 HFD-DEN-2
GSM5049490 HFD-DEN-3
GSM5049491 HFD-DEN-4
```

如果你对GEO数据库有足够了解，就会分析有原始的counts表达量矩阵：ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE165nnn/GSE165752/suppl/GSE165752_HFD_counts_2020.csv.gz

我在表达芯片的公共数据库挖掘系列推文详细介绍过如何定位到各个数据集的关键信息：

- [解读GEO数据存放规律及下载，一文就够](https://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&mid=2247486063&idx=1&sn=156bee5397e979722b36b78284188538&scene=21#wechat_redirect)
- [解读SRA数据库规律一文就够](https://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&mid=2247486054&idx=1&sn=209975adee162228cfe6e6c5065c5c8c&scene=21#wechat_redirect)
- [从GEO数据库下载得到表达矩阵一文就够](https://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&mid=2247486087&idx=1&sn=1e775a1c3e215384e381953a9fa74ec3&scene=21#wechat_redirect)
- [GSEA分析一文就够（单机版+R语言版）](https://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&mid=2247486090&idx=1&sn=62374fbdd4f20c3185beb6568bbeb3e9&scene=21#wechat_redirect)
- [根据分组信息做差异分析- 这个一文不够的](https://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&mid=2247486112&idx=1&sn=67a2104c62222bcb139623699f874a6c&scene=21#wechat_redirect)
- [差异分析得到的结果注释一文就够](https://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&mid=2247486120&idx=1&sn=14d7892c1beec2fb9cdfc0ec0aba3e4e&scene=21#wechat_redirect)

### 数据处理详情

而且文章对他们自己的转录组数据处理描述的清清楚楚，如下所示；

- After cDNA synthesis, adapter ligation, and final cDNA library generation, samples were sequenced on a flow cell (1x50bp single-end reads) and HiSeq4000 (Illumina).
- Data processing was conducted in an NGS pipeline (Snakemake) and quality control was performed with FastQC.
- Trimmed data was analyzed for differential expression (DEseq2) and gene set enrichment analysis (GSEA) to look for KEGG pathways and gene ontologies (GO) of interest.

是单端50bp的数据，使用FastQC软件进行质量控制，然后是DEseq2进行差异分析。后续也是KEGG和GO等生物学功能数据库的简单注释。

### 类似的完美数据集非常多

再比如：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE165552

同样是是2X2=4的分组，每个组是2个重复，所以是8个样品，如下所示：

```
GSM5037093 MP41 cells DMSO exp1
GSM5037095 MP41 cells DMSO exp2
GSM5037097 MP41 cells FR exp1
GSM5037098 MP41 cells FR exp2
GSM5037100 MP46 cells DMSO exp1
GSM5037102 MP46 cells DMSO exp2
GSM5037104 MP46 cells FR exp1
GSM5037105 MP46 cells FR exp2
```

这个数据集发表在 *J Biol Chem*杂志的文章，Available online 10 February 2021, 《Targeting primary and metastatic uveal melanoma with a G protein inhibitor》，链接是： PMID: [33577798](https://www.ncbi.nlm.nih.gov/pubmed/33577798)

数据处理详情：

- FastQ files were aligned to the transcriptome and the whole-genome with STAR.
- Biologic replicates were simultaneously analyzed by edgeR and Sailfish analyses of gene-level/exon-level features. Unexpressed genes and exons were removed from the analyses.
- Unsupervised principal component analysis was generated in Bioconductor using edgeR.
- Direct comparison of FR response in MP41 versus MP46 cells was used to identify MP41-specific genes and this list was used as signature gene sets for Gene Set Enrichment Analysis (GSEA)

可以看到其使用的软件跟上面那个数据集完全不一样了，比对呢选择的是STAR软件，差异又是edgeR and Sailfish。

背景知识：

- FR900359 (FR) 是一个药物：
- A novel therapeutic approach has been suggested by the discovery that UM cell lines driven by mutant constitutively active Gq or G11 can be targeted by FR900359 (FR) or YM-254890, which are bioavailable, selective inhibitors of the Gq/11/14 subfamily of heterotrimeric G proteins.
- MP41和MP46是两个Uveal melanoma (UM)的细胞系模型：
- We addressed this question initially by analyzing MP41 (class 1; BAP1+) and MP46 (class 2; BAP1-deficient) cell lines, which were established originally from patient-derived xenografts (PDX) of primary UM tumors

同样是提供了表达量矩阵供下载：

```
GSM5037093_sample.MP41_d_1.txt.gz 1.3 Mb
GSM5037095_sample.MP41_d_2.txt.gz 1.3 Mb
GSM5037097_sample.MP41_fr_1.txt.gz 1.3 Mb
GSM5037098_sample.MP41_fr_2.txt.gz 1.3 Mb
GSM5037100_sample.MP46_d_1.txt.gz 1.3 Mb
GSM5037102_sample.MP46_d_2.txt.gz 1.3 Mb
GSM5037104_sample.MP46_fr_1.txt.gz 1.3 Mb
GSM5037105_sample.MP46_fr_2.txt.gz 1.4 Mb
```

如果是这样的数据集，大家完全无需委托我们了哈， [明码标价之普通转录组上游分析](https://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&mid=2247499208&idx=1&sn=5ef47d5e0d2ebfe61e481e02963bdcef&scene=21#wechat_redirect)，针对的是绝大部分并没有提供表达量矩阵的公共数据集哦。因为我们会 [使用ebi数据库直接下载fastq测序数据](https://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&mid=2247492889&idx=2&sn=bc2ef17a3b96a257fb692f73338c6b0f&scene=21#wechat_redirect) , 然后走这个上游流程，会耗费我们的计算资源，所以明码标价收费800元人民币，仅供有需要的小伙伴哦！

生信菜鸟团

欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee

如果所有的RNA-seq项目都这样提供数据

2026年4月
一	二	三	四	五	六	日
« 九
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30