下载TCGA所有癌症的maf文件做signature分析
首先TCGA所有癌症的maf文件
maf格式的mutation记录文件在TCGA里面已经是level4的数据啦,所以是完全open的,可以随意下载,只需要去其GDC官网简单点击,选择即可。
主要步骤就是在https://portal.gdc.cancer.gov/repository里面点击过滤文件类型,选择maf格式,再过滤access权限,选择open即可,最后得到的132个文件就是我们需要的。
总共是2.19GB的文件,每个癌症种类都有4种maf文件,分别是用mutect,muse,vanscan,somaticsniper这4款软件call 到的somatic mutation文件。
下载方式这里我选择下载它们132个文件的manifest文件,然后用GDC提供的官方工具来下载!关于这个工具,我 在生信技能树论坛写过教程,就不多说了,自己去看哈,现在下载TCGA数据也是非常方便,首先是GDC网站及客户端 就是安装成功后,运行 ./gdc-client download -m manifest_xxx.txt
j即可。这个manifest文件就是自己刚才创造并且下载的。
cd ~/institute/TCGA/GDC_NCBI/all ~/biosoft/GDC/gdc-client download -m gdc_manifest.2017-08-25T02-57-11.281090.txt
但是这个工具,提供的电脑操作系统版本有限哦
If you are a user of CentOS 6 or RedHat Enterprise Release 6 and wish to use the Data Transfer Tool, contact the GDC Help Desk for assistance.
所以我是在MAC里面下载好了,再上传到我的服务器去的!
然后根据MAF文件制作signature
我是根据这篇文章Mutational Profile of Metastatic Breast Cancers: A Retrospective Analysis里面的方法来做的,他们的方法描述如下:
De novo mutational signature analysis was done using the Matlab Welcome Trust Sanger Institute’s signature framework. We used the deconstructSigs R package to determine the contribution of the known signatures that explain each sample mutational profile with more than 50 somatic mutations. We considered the 13 signatures (Signatures 1, 2, 3, 5, 6, 8, 10, 13, 17, 18, 20, 26, and 30) operative in breast cancer as defined in COSMIC (http://cancer.sanger.ac.uk/signatures/matrix.png). A signature was defined as operative or predominant if its contribution to the mutational pattern was respectively >25% (or >100 mutations) or >50%.
虽然我以前写过一个类似的教程,用SomaticSignatures包来解析maf突变数据获得mutation signature 这里还是再学习学习这个新的工具deconstructSigs R package吧。
这是个R包,所以直接在Rstudio里面安装即可,这里选取BRCA的somatic mutation的MAF文件做一下分析,看看四个软件找出的变异,是否在signature上面有差异。
install.packages('deconstructSigs')
# dependencies 'BSgenome', 'BSgenome.Hsapiens.UCSC.hg19'
BiocInstaller::biocLite('BSgenome')
BiocInstaller::biocLite('BSgenome.Hsapiens.UCSC.hg19')
## https://github.com/raerose01/deconstructSigs
file1='TCGA.STAD.muse..somatic.maf.gz'
TCGA.STAD.muse=read.table(file1,sep = '\t',quote="",header = T)
TCGA.STAD.muse[1:5,1:15]
## data frame including 5 columns: sample.ID,chr,pos,ref,alt
sample.mut.ref <- data.fram(Sample='TCGA.STAD.muse',
chr = TCGA.STAD.muse[,5],
pos = TCGA.STAD.muse[,6],
ref = TCGA.STAD.muse[,11],
alt = TCGA.STAD.muse[,13])
sigs.input <- mut.to.sigs.input(mut.ref = sample.mut.ref,
sample.id = "Sample",
chr = "chr",
pos = "pos",
ref = "ref",
alt = "alt")
class(sigs.input)
sample_1 = whichSignatures(tumor.ref = sigs.input,
signatures.ref = signatures.nature2013,
sample.id = 'TCGA.STAD.muse',
contexts.needed = TRUE,
tri.counts.method = 'exome')
# Plot example
plot_example <- whichSignatures(tumor.ref = sigs.input ,
signatures.ref = signatures.nature2013,
sample.id = 'TCGA.STAD.muse' )
# Plot output
plotSignatures(plot_example, sub = 'example')
这个里面有一个问题,就是deconstructSigs R package似乎只支持hg19版本的基因组,而我下载的TCGA的MAF是hg38版本的,所以代码虽然是对的,但实际上做出的结果是不对的,需要把下载的TCGA的maf文件进行坐标转换。
而所谓批量,无非就是在上面的R脚本里面加入一个循环咯。
注意事项,下载的MAF文件可能有两种格式 ,可能是47列,或者120列,第一行一般都是 头文件,注释着每一列的信息,的确,信息量有点略大。如下:
1 Hugo_Symbol 2 Entrez_Gene_Id 3 Center 4 NCBI_Build 5 Chromosome 6 Start_Position 7 End_Position 8 Strand 9 Consequence 10 Variant_Classification 11 Variant_Type 12 Reference_Allele 13 Tumor_Seq_Allele1 14 Tumor_Seq_Allele2 15 dbSNP_RS 16 dbSNP_Val_Status 17 Tumor_Sample_Barcode 18 Matched_Norm_Sample_Barcode 19 Match_Norm_Seq_Allele1 20 Match_Norm_Seq_Allele2 21 Tumor_Validation_Allele1 22 Tumor_Validation_Allele2 23 Match_Norm_Validation_Allele1 24 Match_Norm_Validation_Allele2 25 Verification_Status 26 Validation_Status 27 Mutation_Status 28 Sequencing_Phase 29 Sequence_Source 30 Validation_Method 31 Score 32 BAM_File 33 Sequencer 34 t_ref_count 35 t_alt_count 36 n_ref_count 37 n_alt_count 38 HGVSc 39 HGVSp 40 HGVSp_Short 41 Transcript_ID 42 RefSeq 43 Protein_position 44 Codons 45 Hotspot 46 cDNA_change 47 Amino_Acid_Change
1 Hugo_Symbol 2 Entrez_Gene_Id 3 Center 4 NCBI_Build 5 Chromosome 6 Start_Position 7 End_Position 8 Strand 9 Variant_Classification 10 Variant_Type 11 Reference_Allele 12 Tumor_Seq_Allele1 13 Tumor_Seq_Allele2 14 dbSNP_RS 15 dbSNP_Val_Status 16 Tumor_Sample_Barcode 17 Matched_Norm_Sample_Barcode 18 Match_Norm_Seq_Allele1 19 Match_Norm_Seq_Allele2 20 Tumor_Validation_Allele1 21 Tumor_Validation_Allele2 22 Match_Norm_Validation_Allele1 23 Match_Norm_Validation_Allele2 24 Verification_Status 25 Validation_Status 26 Mutation_Status 27 Sequencing_Phase 28 Sequence_Source 29 Validation_Method 30 Score 31 BAM_File 32 Sequencer 33 Tumor_Sample_UUID 34 Matched_Norm_Sample_UUID 35 HGVSc 36 HGVSp 37 HGVSp_Short 38 Transcript_ID 39 Exon_Number 40 t_depth 41 t_ref_count 42 t_alt_count 43 n_depth 44 n_ref_count 45 n_alt_count 46 all_effects 47 Allele 48 Gene 49 Feature 50 Feature_type 51 One_Consequence 52 Consequence 53 cDNA_position 54 CDS_position 55 Protein_position 56 Amino_acids 57 Codons 58 Existing_variation 59 ALLELE_NUM 60 DISTANCE 61 TRANSCRIPT_STRAND 62 SYMBOL 63 SYMBOL_SOURCE 64 HGNC_ID 65 BIOTYPE 66 CANONICAL 67 CCDS 68 ENSP 69 SWISSPROT 70 TREMBL 71 UNIPARC 72 RefSeq 73 SIFT 74 PolyPhen 75 EXON 76 INTRON 77 DOMAINS 78 GMAF 79 AFR_MAF 80 AMR_MAF 81 ASN_MAF 82 EAS_MAF 83 EUR_MAF 84 SAS_MAF 85 AA_MAF 86 EA_MAF 87 CLIN_SIG 88 SOMATIC 89 PUBMED 90 MOTIF_NAME 91 MOTIF_POS 92 HIGH_INF_POS 93 MOTIF_SCORE_CHANGE 94 IMPACT 95 PICK 96 VARIANT_CLASS 97 TSL 98 HGVS_OFFSET 99 PHENO 100 MINIMISED 101 ExAC_AF 102 ExAC_AF_Adj 103 ExAC_AF_AFR 104 ExAC_AF_AMR 105 ExAC_AF_EAS 106 ExAC_AF_FIN 107 ExAC_AF_NFE 108 ExAC_AF_OTH 109 ExAC_AF_SAS 110 GENE_PHENO 111 FILTER 112 CONTEXT 113 src_vcf_id 114 tumor_bam_uuid 115 normal_bam_uuid 116 case_id 117 GDC_FILTER 118 COSMIC 119 MC3_Overlap 120 GDC_Validation_Status