下载TCGA所有癌症的maf文件做signature分析

下载TCGA所有癌症的maf文件做signature分析

sanger研究所已经做好了这个分析,但是值得我们重复一下,效果如下:

首先TCGA所有癌症的maf文件

maf格式的mutation记录文件在TCGA里面已经是level4的数据啦,所以是完全open的,可以随意下载,只需要去其GDC官网简单点击,选择即可。

主要步骤就是在https://portal.gdc.cancer.gov/repository里面点击过滤文件类型,选择maf格式,再过滤access权限,选择open即可,最后得到的132个文件就是我们需要的。

总共是2.19GB的文件,每个癌症种类都有4种maf文件,分别是用mutect,muse,vanscan,somaticsniper这4款软件call 到的somatic mutation文件。

下载方式这里我选择下载它们132个文件的manifest文件,然后用GDC提供的官方工具来下载!关于这个工具,我 在生信技能树论坛写过教程,就不多说了,自己去看哈,现在下载TCGA数据也是非常方便,首先是GDC网站及客户端 就是安装成功后,运行 ./gdc-client download -m manifest_xxx.txt j即可。这个manifest文件就是自己刚才创造并且下载的。

cd ~/institute/TCGA/GDC_NCBI/all 
~/biosoft/GDC/gdc-client download -m gdc_manifest.2017-08-25T02-57-11.281090.txt

但是这个工具,提供的电脑操作系统版本有限哦

If you are a user of CentOS 6 or RedHat Enterprise Release 6 and wish to use the Data Transfer Tool, contact the GDC Help Desk for assistance.

所以我是在MAC里面下载好了,再上传到我的服务器去的!

然后根据MAF文件制作signature

我是根据这篇文章Mutational Profile of Metastatic Breast Cancers: A Retrospective Analysis里面的方法来做的,他们的方法描述如下:

De novo mutational signature analysis was done using the Matlab Welcome Trust Sanger Institute’s signature framework. We used the deconstructSigs R package to determine the contribution of the known signatures that explain each sample mutational profile with more than 50 somatic mutations. We considered the 13 signatures (Signatures 1, 2, 3, 5, 6, 8, 10, 13, 17, 18, 20, 26, and 30) operative in breast cancer as defined in COSMIC (http://cancer.sanger.ac.uk/signatures/matrix.png). A signature was defined as operative or predominant if its contribution to the mutational pattern was respectively >25% (or >100 mutations) or >50%.

虽然我以前写过一个类似的教程,用SomaticSignatures包来解析maf突变数据获得mutation signature 这里还是再学习学习这个新的工具deconstructSigs R package吧。

这是个R包,所以直接在Rstudio里面安装即可,这里选取BRCA的somatic mutation的MAF文件做一下分析,看看四个软件找出的变异,是否在signature上面有差异。

install.packages('deconstructSigs')
# dependencies 'BSgenome', 'BSgenome.Hsapiens.UCSC.hg19' 
BiocInstaller::biocLite('BSgenome')
BiocInstaller::biocLite('BSgenome.Hsapiens.UCSC.hg19')
## https://github.com/raerose01/deconstructSigsfile1='TCGA.STAD.muse..somatic.maf.gz'
TCGA.STAD.muse=read.table(file1,sep = '\t',quote="",header = T)
TCGA.STAD.muse[1:5,1:15]
## data frame including 5 columns: sample.ID,chr,pos,ref,alt 
sample.mut.ref <- data.fram(Sample='TCGA.STAD.muse', 
                                chr = TCGA.STAD.muse[,5], 
                                pos = TCGA.STAD.muse[,6], 
                                ref = TCGA.STAD.muse[,11], 
                                alt = TCGA.STAD.muse[,13])
sigs.input <- mut.to.sigs.input(mut.ref = sample.mut.ref, 
                                sample.id = "Sample", 
                                chr = "chr", 
                                pos = "pos", 
                                ref = "ref", 
                                alt = "alt")
class(sigs.input)
​
sample_1 = whichSignatures(tumor.ref = sigs.input, 
                           signatures.ref = signatures.nature2013, 
                           sample.id = 'TCGA.STAD.muse', 
                           contexts.needed = TRUE,
                           tri.counts.method = 'exome')
                           
# Plot example                  
plot_example <- whichSignatures(tumor.ref = sigs.input  ,  
                       signatures.ref = signatures.nature2013, 
                       sample.id = 'TCGA.STAD.muse' )
​
# Plot output
plotSignatures(plot_example, sub = 'example')

这个里面有一个问题,就是deconstructSigs R package似乎只支持hg19版本的基因组,而我下载的TCGA的MAF是hg38版本的,所以代码虽然是对的,但实际上做出的结果是不对的,需要把下载的TCGA的maf文件进行坐标转换。

而所谓批量,无非就是在上面的R脚本里面加入一个循环咯。

注意事项,下载的MAF文件可能有两种格式 ,可能是47列,或者120列,第一行一般都是 头文件,注释着每一列的信息,的确,信息量有点略大。如下:

     1  Hugo_Symbol
     2  Entrez_Gene_Id
     3  Center
     4  NCBI_Build
     5  Chromosome
     6  Start_Position
     7  End_Position
     8  Strand
     9  Consequence
    10  Variant_Classification
    11  Variant_Type
    12  Reference_Allele
    13  Tumor_Seq_Allele1
    14  Tumor_Seq_Allele2
    15  dbSNP_RS
    16  dbSNP_Val_Status
    17  Tumor_Sample_Barcode
    18  Matched_Norm_Sample_Barcode
    19  Match_Norm_Seq_Allele1
    20  Match_Norm_Seq_Allele2
    21  Tumor_Validation_Allele1
    22  Tumor_Validation_Allele2
    23  Match_Norm_Validation_Allele1
    24  Match_Norm_Validation_Allele2
    25  Verification_Status
    26  Validation_Status
    27  Mutation_Status
    28  Sequencing_Phase
    29  Sequence_Source
    30  Validation_Method
    31  Score
    32  BAM_File
    33  Sequencer
    34  t_ref_count
    35  t_alt_count
    36  n_ref_count
    37  n_alt_count
    38  HGVSc
    39  HGVSp
    40  HGVSp_Short
    41  Transcript_ID
    42  RefSeq
    43  Protein_position
    44  Codons
    45  Hotspot
    46  cDNA_change
    47  Amino_Acid_Change
     1  Hugo_Symbol
     2  Entrez_Gene_Id
     3  Center
     4  NCBI_Build
     5  Chromosome
     6  Start_Position
     7  End_Position
     8  Strand
     9  Variant_Classification
    10  Variant_Type
    11  Reference_Allele
    12  Tumor_Seq_Allele1
    13  Tumor_Seq_Allele2
    14  dbSNP_RS
    15  dbSNP_Val_Status
    16  Tumor_Sample_Barcode
    17  Matched_Norm_Sample_Barcode
    18  Match_Norm_Seq_Allele1
    19  Match_Norm_Seq_Allele2
    20  Tumor_Validation_Allele1
    21  Tumor_Validation_Allele2
    22  Match_Norm_Validation_Allele1
    23  Match_Norm_Validation_Allele2
    24  Verification_Status
    25  Validation_Status
    26  Mutation_Status
    27  Sequencing_Phase
    28  Sequence_Source
    29  Validation_Method
    30  Score
    31  BAM_File
    32  Sequencer
    33  Tumor_Sample_UUID
    34  Matched_Norm_Sample_UUID
    35  HGVSc
    36  HGVSp
    37  HGVSp_Short
    38  Transcript_ID
    39  Exon_Number
    40  t_depth
    41  t_ref_count
    42  t_alt_count
    43  n_depth
    44  n_ref_count
    45  n_alt_count
    46  all_effects
    47  Allele
    48  Gene
    49  Feature
    50  Feature_type
    51  One_Consequence
    52  Consequence
    53  cDNA_position
    54  CDS_position
    55  Protein_position
    56  Amino_acids
    57  Codons
    58  Existing_variation
    59  ALLELE_NUM
    60  DISTANCE
    61  TRANSCRIPT_STRAND
    62  SYMBOL
    63  SYMBOL_SOURCE
    64  HGNC_ID
    65  BIOTYPE
    66  CANONICAL
    67  CCDS
    68  ENSP
    69  SWISSPROT
    70  TREMBL
    71  UNIPARC
    72  RefSeq
    73  SIFT
    74  PolyPhen
    75  EXON
    76  INTRON
    77  DOMAINS
    78  GMAF
    79  AFR_MAF
    80  AMR_MAF
    81  ASN_MAF
    82  EAS_MAF
    83  EUR_MAF
    84  SAS_MAF
    85  AA_MAF
    86  EA_MAF
    87  CLIN_SIG
    88  SOMATIC
    89  PUBMED
    90  MOTIF_NAME
    91  MOTIF_POS
    92  HIGH_INF_POS
    93  MOTIF_SCORE_CHANGE
    94  IMPACT
    95  PICK
    96  VARIANT_CLASS
    97  TSL
    98  HGVS_OFFSET
    99  PHENO
   100  MINIMISED
   101  ExAC_AF
   102  ExAC_AF_Adj
   103  ExAC_AF_AFR
   104  ExAC_AF_AMR
   105  ExAC_AF_EAS
   106  ExAC_AF_FIN
   107  ExAC_AF_NFE
   108  ExAC_AF_OTH
   109  ExAC_AF_SAS
   110  GENE_PHENO
   111  FILTER
   112  CONTEXT
   113  src_vcf_id
   114  tumor_bam_uuid
   115  normal_bam_uuid
   116  case_id
   117  GDC_FILTER
   118  COSMIC
   119  MC3_Overlap
   120  GDC_Validation_Status

Comments are closed.