我们用188 non-small cell lung tumors数据来测试了一个R语言程序,find driver genes in cancer ~
软件地址如下:http://linus.nci.nih.gov/Data/YounA/software.zip
这是一个R语言程序,里面有readme,用法很简单。
准备好两个文件,分别是silent_mutation_table.txt and nonsilent_mutation_table.txt ,它们都是普通文本格式数据,内容如下,就是把找到的snp格式化,根据注释结果分成silent和nonsilent即可。
#Ensembl_gene_id Chromosome Start_position Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode
#ENSG00000122477 1 100390656 SNP G G A TCGA-23-1022-01A-02W-0488-09
然后直接运行程序包里面的主程序,在R语言里面source("main_R_script.r")
We reanalyzed sequence data for 623 candidate genes in 188 non-small cell lung tumors using the new method.
to identify genes that are frequently mutated and thereby are expected to have primary roles in thedevelopment of tumor
To find these driver genes, each gene is tested for whether its mutation rate is significantly higher than the background (or passenger) mutation rate.
Some investigators (Sjoblom et al., 2006) further divide mutations into several types according to the nucleotide and the neighboring nucleotides of the mutations.
Ding et al. (2008)的方法的三个缺点:
1、different types of mutations can have different impact on proteins.(越影响蛋白功能的突变,越有可能是driver mutation)
2、different samples have different background mutation rates. (在高突变背景的样本中的突变,很可能是高突变背景的原因,而不是因为癌症)
3、a different number of non-silent mutations can occur at each base pair according to the genetic code.(比如Tryptophan仅仅只有一个密码子,而arginine高达6个密码子)
我们提出的方法的4个优点是:
1,我们对非同义突变根据它们对蛋白功能的影响进行了评级打分。
2,我们允许不同的样品有着不同的BMR
3,that whether the mutation is non-silent or silent depends on the genetic code
4,we take into account uncertainties in the background mutation rate by using empirical Bayes methods
还有5个需要改进的地方:
1,However, the functional impact is also dependent on the position in which a mutation occurs.(我们仅仅考虑了突变对氨基酸的改变)
2,the current scoring system which assigns mutation scores in the order: missense mutation<inframe indel<mutation in splice sites<frameshift indel=nonsense mutation may be biased toward identifying tumor suppressor genes over oncogenes.
3,we may refine our background mutation model in Table 1 so that all six types of mutations, A:T→G:C, A:T→C:G, A: T→T :A,G:C→A:T, G:C→T :A, G:C→C:G have separate mutation rates.
4,we did not take into account correlations among mutations in identifying driver genes.
5,one might combine both copy number variation and sequencing data to identify driver genes.
HGNC定义的gene Symbol转为ensemble数据库的ID,的R语言代码:
library(biomaRt)
ensembl=useMart("ensembl",dataset = "hsapiens_gene_ensembl")
all.gene.table = read.table("all_gene.symbol", header=F)
convert=getBM(attributes = c("chromosome_name","ensembl_gene_id","hgnc_symbol"),filters =c("hgnc_symbol"),values=all.gene.table[,1],mart=ensembl)
chromosome=c(1:22,"X","Y","M")
convert=convert[!is.na(match(convert[,1],chromosome)),2:3] #remove names whose matching chromosome is not 1-22, X, or Y.
convert=convert[rowSums(convert=="")==0,]
write.table(convert,"ensembl2symbol.list",quote = F,row.names =F,col.names =F)
write.table(convert,"all_gene_name.txt",quote = F,row.names =F,col.names =F)
一个gene Symbol可能对应着多个ensemble ID号,但是在每个染色体上面是一对一的关系。
有些gene Symbol可能找不到ensemble ID号,一般情况是因为这个gene Symbol并不是纯粹的HGNC定义的,或者是比较陈旧的ID。
比如下面的TIGAR ,就很可能被写作是C12orf5
Aliases for TIGAR Gene
TP53 Induced Glycolysis Regulatory Phosphatase 2 3
TP53-Induced Glycolysis And Apoptosis Regulator 2 3 4
C12orf5 3 4 6
Probable Fructose-2,6-Bisphosphatase TIGAR 3
Fructose-2,6-Bisphosphate 2-Phosphatase 3
Chromosome 12 Open Reading Frame 5 2
Fructose-2,6-Bisphosphatase TIGAR 3
Transactivated By NS3TP2 Protein 3
EC 3.1.3.46 4
FR2BP 3
External Ids for TIGAR Gene
HGNC: 1185 Entrez Gene: 57103 Ensembl: ENSG00000078237 OMIM: 610775 UniProtKB: Q9NQ88
Previous HGNC Symbols for TIGAR Gene
C12orf5
Export aliases for TIGAR gene to outside databases