3大数据库超2万RNA-seq数据重新统一处理

各种大型计划产出的RNA-seq数据资源已经非常丰富了，但是大家都想把多个数据库联合起来分析，就不得不面对批次效应这个问题，所以UCSC团队就使用统一的流程把这些数据重新处理了，在亚马逊云上，一个样本花费1.3美元。
发表在：Nature Biotechnology publication: https://doi.org/10.1038/nbt.3772
3大数据库是：

The Cancer Genome Atlas (TCGA)
Genotype-Tissue Expression (GTEx)
Therapeutically Applicable Research To Generate Effective Treatments (TARGET)
而且还提供网页工具供查询使用：

Differential gene and isoform expression of FOXM1 transcription factor in TCGA vs. GTEx

使用的数据处理流程

如下图： CutAdapt was used for adapter trimming, STAR was used for alignment, and RSEM and Kallisto were used as quantifiers.

流程介绍

如果你对RNA-seq数据处理流程有意外，直接去看我长达74个小时全套生物信息学入门视频：生信技能树视频课程学习路径，这么好的视频还免费！

参考基因组选择

STAR, RSEM, and Kallisto indexes were all built with the same reference genome. HG38 (no alt analysis) with overlapping genes from the PAR locus removed (chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415).
- ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines
  
  注释文件的选择
RSEM: Gencode V23 comprehensive annotation (CHR)
- http://www.gencodegenes.org/releases/23.html first row
Kallisto: Gencode V23 comprehensive annotation (ALL)
- http://www.gencodegenes.org/releases/23.html second row
  
  软件参数的选择
STAR
- sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/star --runThreadN 32 --runMode genomeGenerate --genomeDir /data/genomeDir --genomeFastaFiles hg38.fa --sjdbGTFfile gencode.v23.annotation.gtf
Kallisto
- sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/kallisto index -i hg38.gencodeV23.transcripts.idx transcriptome_hg38_gencodev23.fasta
- Kallisto index that was used during the recompute is available here.
RSEM
- sudo docker run -v $(pwd):/data --entrypoint=rsem-prepare-reference jvivian/rsem -p 4 --gtf gencode.v23.annotation.gtf hg38.fa hg38
  可以看到，上面的3大要素，就是我五年前在生信菜鸟团博客写教程的基本规律。
  
  Raw data
  
  Nature Publication Supplementary Note 7 – Data Availability
  
  Submitter sample ID to Xena sample ID mapping
  
  TCGA mapping
  
  GTEx mapping
  
  TARGET mapping
  
  最后公布的可供下载的数据集
GTEX (11 datasets)
TARGET Pan-Cancer (PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443) (12 datasets)
TCGA and TARGET Pan-Cancer (PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443) (4 datasets)
TCGA Pan-Cancer (PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443) (10 datasets)
TCGA TARGET GTEx (13 datasets)

其中TCGA TARGET GTEx 3大数据库) (共有 13 datasets)

cohort: TCGA TARGET GTEx

表达矩阵样本量很可观
RSEM expected_count
(n=19,109)
UCSC Toil RNAseq Recompute
RSEM expected_count (DESeq2 standardized)
(n=19,039)
UCSC Toil RNAseq Recompute
RSEM expected_count output normalized using DESeq2
RSEM fpkm
(n=19,131)
UCSC Toil RNAseq Recompute
RSEM norm_count
(n=19,120)
UCSC Toil RNAseq Recompute
TCGA TARGET GTEx gene expression by UCSC TOIL RNA-seq recompute
RSEM tpm
(n=19,131)
UCSC Toil RNAseq Recompute

phenotype
TCGA GTEX main categories
(n=17,221)
UCSC Toil RNAseq Recompute
TCGA survival data
(n=10,496)
UCSC Toil RNAseq Recompute
TCGA TARGET GTEX selected phenotypes
(n=19,131)
UCSC Toil RNAseq Recompute

somatic mutation (SNP and INDEL)
TCGA somatic mutations (Pan-cancer Atlas MC3 public version)
(n=8,463)
UCSC Toil RNAseq Recompute

transcript expression RNAseq
RSEM expected_count
(n=19,109)
UCSC Toil RNAseq Recompute
TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute
RSEM fpkm
(n=19,129)
UCSC Toil RNAseq Recompute
TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute
RSEM isoform percentage
(n=19,131)
UCSC Toil RNAseq Recompute
TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute
RSEM tpm
(n=19,131)
UCSC Toil RNAseq Recompute
TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute

一	二	三	四	五	六	日
« 九
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

3大数据库超2万RNA-seq数据重新统一处理

Differential gene and isoform expression of FOXM1 transcription factor in TCGA vs. GTEx

使用的数据处理流程

流程介绍

参考基因组选择

注释文件的选择

软件参数的选择

Raw data

Submitter sample ID to Xena sample ID mapping

TCGA mapping

GTEx mapping

TARGET mapping

最后公布的可供下载的数据集

其中TCGA TARGET GTEx 3大数据库) (共有 13 datasets)

cohort: TCGA TARGET GTEx

表达矩阵样本量很可观

phenotype

somatic mutation (SNP and INDEL)

transcript expression RNAseq