3大数据库超2万RNA-seq数据重新统一处理

3大数据库超2万RNA-seq数据重新统一处理

各种大型计划产出的RNA-seq数据资源已经非常丰富了,但是大家都想把多个数据库联合起来分析,就不得不面对批次效应这个问题,所以UCSC团队就使用统一的流程把这些数据重新处理了,在亚马逊云上,一个样本花费1.3美元。
发表在:Nature Biotechnology publication: https://doi.org/10.1038/nbt.3772
3大数据库是:

  1. The Cancer Genome Atlas (TCGA)
  2. Genotype-Tissue Expression (GTEx)
  3. Therapeutically Applicable Research To Generate Effective Treatments (TARGET)
    而且还提供网页工具供查询使用:

    Differential gene and isoform expression of FOXM1 transcription factor in TCGA vs. GTEx

    使用的数据处理流程

    如下图: CutAdapt was used for adapter trimming, STAR was used for alignment, and RSEM and Kallisto were used as quantifiers.
    img

    流程介绍

    如果你对RNA-seq数据处理流程有意外,直接去看我长达74个小时全套生物信息学入门视频:生信技能树视频课程学习路径,这么好的视频还免费!

    参考基因组选择

  • STAR, RSEM, and Kallisto indexes were all built with the same reference genome. HG38 (no alt analysis) with overlapping genes from the PAR locus removed (chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415).
    • ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines

      注释文件的选择

  • RSEM: Gencode V23 comprehensive annotation (CHR)
    • http://www.gencodegenes.org/releases/23.html first row
  • Kallisto: Gencode V23 comprehensive annotation (ALL)
    • http://www.gencodegenes.org/releases/23.html second row

      软件参数的选择

  • STAR
    • sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/star --runThreadN 32 --runMode genomeGenerate --genomeDir /data/genomeDir --genomeFastaFiles hg38.fa --sjdbGTFfile gencode.v23.annotation.gtf
  • Kallisto
    • sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/kallisto index -i hg38.gencodeV23.transcripts.idx transcriptome_hg38_gencodev23.fasta
    • Kallisto index that was used during the recompute is available here.
  • RSEM
  • GTEX (11 datasets)
  • TARGET Pan-Cancer (PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443) (12 datasets)
  • TCGA and TARGET Pan-Cancer (PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443) (4 datasets)
  • TCGA Pan-Cancer (PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443) (10 datasets)
  • TCGA TARGET GTEx (13 datasets)

    其中TCGA TARGET GTEx 3大数据库) (共有 13 datasets)

    cohort: TCGA TARGET GTEx

    表达矩阵样本量很可观

  • RSEM expected_count
    (n=19,109)
    UCSC Toil RNAseq Recompute
  • RSEM expected_count (DESeq2 standardized)
    (n=19,039)
    UCSC Toil RNAseq Recompute
    RSEM expected_count output normalized using DESeq2
  • RSEM fpkm
    (n=19,131)
    UCSC Toil RNAseq Recompute
  • RSEM norm_count
    (n=19,120)
    UCSC Toil RNAseq Recompute
    TCGA TARGET GTEx gene expression by UCSC TOIL RNA-seq recompute
  • RSEM tpm
    (n=19,131)
    UCSC Toil RNAseq Recompute

    phenotype

  • TCGA GTEX main categories
    (n=17,221)
    UCSC Toil RNAseq Recompute
  • TCGA survival data
    (n=10,496)
    UCSC Toil RNAseq Recompute
  • TCGA TARGET GTEX selected phenotypes
    (n=19,131)
    UCSC Toil RNAseq Recompute

    somatic mutation (SNP and INDEL)

  • TCGA somatic mutations (Pan-cancer Atlas MC3 public version)
    (n=8,463)
    UCSC Toil RNAseq Recompute

    transcript expression RNAseq

  • RSEM expected_count
    (n=19,109)
    UCSC Toil RNAseq Recompute
    TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute
  • RSEM fpkm
    (n=19,129)
    UCSC Toil RNAseq Recompute
    TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute
  • RSEM isoform percentage
    (n=19,131)
    UCSC Toil RNAseq Recompute
    TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute
  • RSEM tpm
    (n=19,131)
    UCSC Toil RNAseq Recompute
    TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute

Comments are closed.