五 19

用GISTIC多个segment文件来找SCNA变异

Posted on 2016年5月19日 by ulwvfje

这个软件在TCGA计划里面被频繁使用者，用这个软件的目的很简单，就是你研究了很多癌症样本，通过芯片得到了每个样本的拷贝数变化信息，芯片结果一般是segment结果，可以解释为CNV区域，需要用GISTIC把样本综合起来分析，寻找somatic的CNV，并且注释基因信息。

有两个难点，一是在linux下面安装matlab工作环境，二是如何制作输入文件。

Continue reading →

五 18

2016-TCGA数据挖掘系列文章之癌症男女有别

Posted on 2016年5月18日 by ulwvfje

这是TCGA数据挖掘系列文章之一，是安德森癌症研究中心的Han Liang主导的，纯粹的生物信息学数据分析文章。

文章题目是：comprehensive characterization of molecular differences in cancer between male and female patients.

研究意义：癌症病人的性别对肿瘤发生，扩散的意义不言而喻。不仅仅是因为很多癌症本来就是有性别特异性，比如卵巢癌之于女性、前列腺癌之于男性。即使对于其它并非性别特异性的癌症种类，男女病人在肿瘤发生，扩散，以及治疗阶段的反应也大不一样。但是以前对这样分子机理研究的很有限，一般集中在某些性别相关的分子pattern，比如非小细胞肺癌女性患者的EGFR突变，但那些研究要么就局限于单一的基因，要么局限于单一的数据类型，或者研究单一的癌症。严重缺乏一个全面的，系统的分析癌症患者的性别差异。而且TCGA数据库的出现让这一个研究变成了可能，这也就是本文章的出现的原因。

数据挖掘的对象：

如表所示，涉及到13种癌症，TCGA的六种数据()都用上了，因为是2016年，所以数据量也比较全面了。

Continue reading →

五 16

TCGA数据挖掘系列文章之-pseudogene假基因探究

Posted on 2016年5月16日 by ulwvfje

这是TCGA数据挖掘系列文章之一，是安德森癌症研究中心的Han Liang主导的，纯粹的生物信息学数据分析文章。

文章见：http://www.nature.com/ncomms/2014/140707/ncomms4963/full/ncomms4963.html

TCGA数据库的数据量现在已经非常可观了，一万多的肿瘤样本数据，关于假基因的这篇文章是2014年发的，所以他们只研究了2,808个样本数据，也只涉及到7个癌症种类。

Continue reading →

五 06

所有TCGA的maf格式somatic突变数据均可下载

Posted on 2016年5月6日 by ulwvfje

如果你研究癌症，那么TCGA计划的如此丰富的公共数据你肯定不能错过，一般人只能获取到level3的数据，当然，其实一般人也没办法使用level1和level2的数据，毕竟近万个癌症样本的原始测序数据，还是很恐怖的，而且我们拿到原始数据，再重新跑pipeline，其实并不一定比人家TCGA本身分析的要好，所以我们直接拿到分析结果，就足够啦！

Continue reading →

五 06

突变频谱探究mutation siganures

Posted on 2016年5月6日 by ulwvfje

这也是对TCGA数据的深度挖掘，从而提出的一个统计学概念。文章研究了30种癌症，发现21种不同的mutation signature。如果理解了，就会发现这个其实蛮简单的，他们并不重新测序，只是拿已经有了的TCGA数据进行分析，而且居然是发表在nature上面！

研究了4,938,362 mutations from 7,042 cancers样本，突变频谱的概念只是针对于somatic 的mutation。一般是对癌症病人的肿瘤组织和癌旁组织配对测序，过滤得到的somatic mutation，一般一个样本也就几百个somatic 的mutation。

paper链接是：http://www.nature.com/nature/journal/v500/n7463/full/nature12477.html

还有其它文章也是这样分析的http://www.sciencedirect.com/science/article/pii/S0959437X13001639

从2013年提出到现在，已经有30种mutation siganures，在cosmic数据库有详细记录，更新见：http://cancer.sanger.ac.uk/cosmic/signatures

它的概念就是：根据突变上下文分成96类，然后每类突变的频率不一样画一个条形图，可视化展现。

Each signature is displayed according to the 96 substitution classification defined by the substitution class and sequence context immediately 3′ and 5′ to the mutated base. The probability bars for the six types of substitutions are displayed in different colours.

仔细看paper，还是蛮好理解的，自己写一个脚本就可以做这个分析了，前提是下载各个癌症的somatic mutation文件，一般是maf格式的，很多途径下载。

In principle, all classes of mutation (such as substitutions, indels, rearrangements) and any accessory mutation characteristic, for example, the sequence context of the mutation or the transcriptional strand on which it occurs, can be incorporated into the set of features by which a mutational signature is defined. In the first instance, we extracted mutational signatures using base substitutions and additionally included information on the sequence context of each mutation. Because there are six classes of base substitution—C>A, C>G, C>T, T>A, T>C, T>G (all substitutions are referred to by the pyrimidine of the mutated Watson–Crick base pair)—and as we incorporated information on the bases immediately 5′ and 3′ to each mutated base, there are 96 possible mutations in this classification. This 96 substitution classification is particularly useful for distinguishing mutational signatures that cause the same substitutions but in different sequence contexts.

很多癌症都发现了不止一种mutation signature，甚至高达6种，说明癌症之间差异还是蛮大的！

In most cancer classes at least two mutational signatures were observed, with a maximum of six in cancers of the liver, uterus and stomach. Although these differences may, in part, be attributable to differences in the power to extract signatures, it seems likely that some cancers have a more complex repertoire of mutational processes than others.

Most individual cancer genomes exhibit more than one mutational signature and many different combinations of signatures were observed

但是，我最后也没能绝对的界限是什么，因为总不能用肉眼来看每个突变频谱不一样吧？

The set of signatures will be updated in the future. This will include incorporating additional mutation types (e.g., indels, structural rearrangements, and localized hypermutation such as kataegis) and cancer samples. With more cancer genome sequences and the additional statistical power this will bring, new signatures may be found, the profiles of current signatures may be further refined, signatures may split into component signatures and signatures may be found in cancer types in which they are currently not detected.

分类会持续不断更新，随着更多的cancer type和样本加入，新的signature会被发现，现有的signature也可能会被重新定义，或者被分割成多个小的signature

四 10

我们进入了一个癌症研究非常好的时代？

Posted on 2016年4月10日 by ulwvfje

在推特上面看到某人推荐了一篇好文，好像是某位大佬在science上面写的，大概就是鼓吹大家进入癌症研究领域。

http://science.sciencemag.org/content/352/6282/123.full

感觉写的还不错，就翻译了一下，译文如下：

Continue reading →

一 28

使用可视化工具MutationMapper来看看基因上面突变的分布

Posted on 2016年1月28日 by ulwvfje

how to generate lollipop diagrams ?

问题来源：https://www.biostars.org/p/65347/

这个工具是网页版的，不需要下载，只需上传两列数据即可，就是基因上面的第几个氨基酸是如何突变的这样的信息

网页就可以把这些信息画在基因上面，而且注释到domain和cosmic数据库，挺好用的！

http://www.cbioportal.org/mutation_mapper.jsp

Hugo_Symbol	HUGO symbol for the gene	TP53
Protein_Change	Amino acid change	V600E

结果也很容易理解！

我输入了4个基因，网页就会对每个基因画一个图，其实ABCA1这个基因有6个突变，就都画在该基因上面了，基因突变分布在基因的哪个domain也看得一清二楚！

我的输入文件是这样的！

非常好用：

Support mutation data with annotated protein effects
Mutation diagram/lollipop view
Mutation table view
3D structure view if available

而且pfam数据库本身也提供了这个功能：

Pfam provides an online tool to not only generate the domain information in JSON format, but to draw the lollipop diagram using javascript as well. They have more information here: http://pfam.xfam.org/help#tabview=tab9

IMHO, not as pretty as cBioPortal's but it gets you close to a solution.

EDIT / SHAMELESS PLUG: After seeing the data available and how easy it'd be, I made my own quick tool to fetch the data and draw the diagram for me in a style similar to cBioPortal - feel free to fork it and add features: https://github.com/pbnjay/lollipops

Example output (w/ labels per the comments)

甚至还有高手用D3.JS写了一个能实现同样需求的模块

We found ourselves in the same need, we wanted such a plot (JavaScript). Thus, I add our solution, Mutations Needle Plot. The library creates an SVG image (with D3), which then may be downloaded.

Live examples are found at BioJS: http://registry.biojs.net/client/#/detail/muts-needle-plot
Code is available at GitHub: https://github.com/bbglab/muts-needle-plot
And it is also a npm-package: https://www.npmjs.com/package/muts-needle-plot

You will npm in order to be able to install & run the library.

Examples may be found in the snippets folder or also the index.html - The one displayed here below

类似的高手很多：

My colleague @SolenaLS recently asked me to write something like this: (uniprot+SVG+javascript : )http://lindenb.github.io/pages/uniprot/paintsvg.html

而且ensembl数据库也提供类似的功能

Ensemble also gives similar plot.

http://www.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?db=core;g=ENSG00000141510;r=17:7565097-7590856;t=ENST00000269305

一 22

用TCGA数据做cox生存分析的风险因子（比例风险模型）

Posted on 2016年1月22日 by ulwvfje

再次强调一下，R里面实现生存分析非常简单！

用my.surv <- surv(OS_MONTHS,OS_STATUS=='DECEASED')构建生存曲线。

用kmfit2 <- survfit(my.surv~TUMOR_STAGE_2009)来做某一个因子的KM生存曲线。用 survdiff(my.surv~type, data=dat)来看看这个因子的不同水平是否有显著差异，其中默认用是的logrank test 方法。

用coxph(Surv(time, status) ~ ph.ecog + tt(age), data=lung) 来检测自己感兴趣的因子是否受其它因子(age,gender等等)的影响。

Continue reading →

一 14

用RNA-SeQC得到表达矩阵RPKM值

Posted on 2016年1月14日 by ulwvfje

这个软件不仅仅能做QC，而且可以统计各个基因的RPKM值！尤其是TCGA计划里面的都是用它算的

一、程序安装

直接在官网下载java版本软件即可使用：http://www.broadinstitute.org/cancer/cga/tools/rnaseqc/RNA-SeQC_v1.1.8.jar

但是需要下载很多注释数据

二、输入数据

箭头所指的文件，一个都不少，只有那个rRNA.tar我没有用，因为这个软件有两种使用方式，我用的是第一种

三、软件使用

软件的官网给力例子，很容易学习：

RNA-SeQC can be run with or without a BWA-based rRNA level estimation mode. To run without (less accurate, but faster) use the command:
java -jar RNASeQC.jar -n 1000 -s "TestId|ThousandReads.bam|TestDesc" -t gencode.v7.annotation_goodContig.gtf -r Homo_sapiens_assembly19.fasta -o ./testReport/ -strat gc -gc gencode.v7.gc.txt

我用的就是这个例子，这个例子需要的所有文件里面，染色体都是没有chr的，这个非常重要！！！

代码如下：

java -jar RNA-SeQC_v1.1.8.jar \
-n 1000 \
-s "TestId|ThousandReads.bam|TestDesc" \
-t gencode.v7.annotation_goodContig.gtf \
-r ~/ref-database/human_g1k_v37/human_g1k_v37.fasta \
-o ./testReport/ \
-strat gc \
-gc gencode.v7.gc.txt \

To run the more accurate but slower, BWA-based method :
java -jar RNASeQC.jar -n 1000 -s "TestId|ThousandReads.bam|TestDesc" -t gencode.v7.annotation_goodContig.gtf -r Homo_sapiens_assembly19.fasta -o ./testReport/ -strat gc -gc gencode.v7.gc.txt -BWArRNA human_all_rRNA.fasta
Note: this assumes BWA is in your PATH. If this is not the case, use the -bwa flag to specify the path to BWA

四、结果解读

运行要点时间，就那个一千条reads的测试数据都搞了10分钟！

出来一大堆突变，具体解释，官网上面很详细，不过，比较重要的当然是RPKM值咯，还有QC的信息

TCGA数据里面都会提供由RNA-SeQC软件处理得到的表达矩阵！

Expression

RPKM data are used as produced by RNA-SeQC.
Filter on >=10 individuals with >0.1 RPKM and raw read counts greater than 6.
Quantile normalization was performed within each tissue to bring the expression profile of each sample onto the same scale.
To protect from outliers, inverse quantile normalization was performed for each gene, mapping each set of expression values to a standard normal.

软件的主页是：

http://www.broadinstitute.org/cancer/cga/rnaseqc_run

http://www.broadinstitute.org/cancer/cga/rnaseqc_download

帮助文件：http://www.broadinstitute.org/cancer/cga/sites/default/files/data/tools/rnaseqc/RNA-SeQC_Help_v1.1.2.pdf

一 11

对CCLE数据库可以做的分析

Posted on 2016年1月11日 by ulwvfje

收集了那么多的癌症细胞系的表达数据，拷贝数变异数据，突变数据，总不能放着让它发霉吧!

这些数据可以利用的地方非常多，但是在谷歌里面搜索引用了它的文章却不多，我挑了其中几个，解读了一下别人是如何利用这个数据的，当然，主要是用那个mRNA的表达数据咯！

第一篇：http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0111146

这篇文献对CCLE的数据进行了八个步骤的处理，一个合格的生物信息学分析着完全可以重写这个过程

step1:Affymetrix U133 Plus2 DNA microarray gene expressions of 27 gastric cancer cell lines (Kato-III, IM95, SNU-620, SNU-16, OCUM-1, NUGC-4, 2313287, HUG1N, MKN45, NCIN87, KE39, AGS, SNU-5, SNU-216, NUGC-3, NUGC-2, MKN74, MKN7, RERFGC1B, GCIY, KE97, Fu97, SH10TC, MKN1, SNU-1, Hs746 T, HGC27) were downloaded from Cancer Cell Line Encyclopedia (CCLE) [16] in March 2013.

step2: Robust Multi-array Average (RMA) normalization was performed. Principal component analysis plot show no obvious batch effect.

step3: The normalized data is then collapsed by taking the probe sets with highest gene expression.

前三步是为了得到27个胃癌相关细胞系的mRNA表达矩阵，方法是下载cel文件，用RMA归一化，对多探针基因去最大表达量探针！

step4:Unsupervised hierarchical clustering (1-Spearman distance, average linkage) was performed on the cell lines using the aCGH data.

Putative driver genes of which copy number aberrations correlated to mRNA gene expression were identified to determine subtypes or clusters that are driven by different mechanisms. This was done using Mann Whitney U-test with p<0.05, and Spearman Correlation Coefficient test with Rho >0.6.

step5:We then performed consensus clustering[17] on the gene expression data of the 27 gastric cancer cell lines from CCLE using these putative driver genes. We selected k = 2 as it gives sufficiently stable similarity matrix.

step6: In order to assign new samples to this integrative cluster, significance analysis of microarray (SAM) [18]with threshold q<2.0 was used to generate subtype signature based on the mRNA expression data of the 1762 genes from the 27 gastric cancer cell lines in CCLE.

先用甲基化数据来聚类，得到putative driver genes，然后再用这些基因的表达数据来再次聚类，分成两类，然后对这两类进行SAM找差异基因

step7:ssGSEA (single sample GSEA)was used to estimate pathway activities of the gastric cancer cell line in the Molecular Signature Database v3.1 (Msigdb v3.1) [19], [20]. The pathway activities are represented in enrichment scores which were rank normalized to [0.0, 1.0].

step8:SAM analysis was performed with threshold q<0.2, and fold change >2.0 (for up-regulated pathways), or <0.5 (for down-regulated pathways) to obtain subtype-specific pathways from the 27 gastric cell lines in CCLE.

这里既用来gene set的富集分析，又用来超几何分布的富集分析，结果去看看这篇文章就知道了！

第二篇文献：http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0081803#pone.0081803.s001

这篇文章只用了CCLE的一个地方，就是看看不同cancer type里面的某个基因表达boxplot

这个图的数据用GEOquery可以得到，样本的分类信息也用GEOquery可以得到，这样就可以做下面这个图了，非常简单

Further, the Cancer Cell Line Encyclopedia (CCLE) database demonstrated that of 1062 cell lines representing 37 distinct cancer types, glioma cell lines express the highest levels of STK17A

结论就是：STK17A is highly expressed in glioma cell lines compared to other cancer types. Data was obtained through the Cancer Cell Line Encyclopedia (CCLE).

第三篇文献：http://www.nature.com/ncomms/2013/130709/ncomms3126/fig_tab/ncomms3126_F4.html

这篇文献更简单了，直接对这个表达矩阵进行聚类：

Evaluating cell lines as tumour models by comparison of genomic profiles

The 5,000 most variable genes were used for unsupervised clustering of cell lines by mRNA expression data. Cell lines are colour-coded (vertical bars) according to the reported tissue of origin (a PDF version that can be enlarged at high resolution is in Supplementary Information, Supplementary Fig. S4); horizontal labels at bottom indicate the dominating tissue types within the respective branches of the dendrogram. Most ovarian cancer cell lines (magenta) cluster together, interspersed with endometrial cell lines. However, some ovarian cancer cell lines cluster with other tissue types (*). Top right panels: neighbourhoods (1) of the top cell lines in our analysis, (2) of cell line IGROV1, and (3) of cell line A2780. For the ovarian cancer cell lines in these enlarged areas, the histological subtype as assigned in the original publication is indicated by coloured letters.

就直接拿整个表达矩阵即可，然后挑选变异最大的5000个基因来进行聚类，就可以得到类似的图

一 11

CCLE数据库几个知识点

Posted on 2016年1月11日 by ulwvfje

发表ccle的文献：http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3320027/

Here we describe the Cancer Cell Line Encyclopedia (CCLE): a compilation of gene expression, chromosomal copy number, and massively parallel sequencing data from 947 human cancer cell lines.

收集了三种数据：

The mutational status of >1,600 genes was determined by targeted massively parallel sequencing, followed by removal of variants likely to be germline events .

Moreover, 392 recurrent mutations affecting 33 known cancer genes were assessed by mass spectrometric genotyping13 .

DNA copy number was measured using high-density single nucleotide polymorphism arrays (Affymetrix SNP 6.0; Supplementary Methods).

Finally, mRNA expression levels were obtained for each of the lines using Affymetrix U133 plus 2.0 arrays.

These data were also used to confirm cell line identities .

一般用得最多的就是表达数据，因为表达数据最简单，大多数生物信息学分析着只会用这个数据！

而它的突变数据又不是通常意义的高通量测序得到的，snp6芯片数据很多人听都没听过

文章的附件有对cell lines的具体描述。

CCLE的数据在broad institute里面可以下载，也放在GEO数据库里面，我比较喜欢GEO里面的数据

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36139

This SuperSeries is composed of the following SubSeries:

GSE36133 Expression data from the Cancer Cell Line Encyclopedia (CCLE)

GSE36138 SNP array data from the Cancer Cell Line Encyclopedia (CCLE)

GSE36133这个study的metadata里面有对每个cellline来源的cancer进行描述！

有人喜欢把这个metadata叫做是clinical data。

library(GEOquery)

ccleFromGEO <- getGEO("GSE36133")

annotBlock1 <- pData(phenoData(ccleFromGEO[[1]]))

>dim(annotBlock1)

[1] 917 38

exprSet=exprs(ccleFromGEO[[1]])

> dim(exprSet)

[1] 18926 917

##它的表达数据矩阵，包含了18926个基因，列名是917个细胞系的名字，行是基因的entrez ID

keyColumns <- c("title", "source_name_ch1", "characteristics_ch1", "characteristics_ch1.1",

"characteristics_ch1.2")

options(stringsAsFactors = F)

allAnnot=annotBlock1[,keyColumns]

##这几列信息是比较重要的metadata，里面详细记录了细胞系的收集公司单位，tissue，癌症分类等信息

Cell line （1035个细胞系简介）Gene Sets

1035 sets of genes with high or low expression in each cell line relative to other cell lines from the CCLE Cell Line Gene Expression Profiles dataset.

http://amp.pharm.mssm.edu/Harmonizome/dataset/CCLE+Cell+Line+Gene+Expression+Profiles

一些关于CCLE数据库的文章：

http://cancerres.aacrjournals.org/content/73/8_Supplement/2409.short

http://cancerres.aacrjournals.org/content/74/22/6390.short

https://clincancerres.aacrjournals.org/content/19/19_Supplement/IA2.abstract

http://onlinelibrary.wiley.com/doi/10.1002/cncy.21471/pdf 介绍了几个类似的数据库资源

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0088557 讲解了high/low的知识

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7060697 药物相关

Anticancer drug sensitivity analysis: An integrated approach applied to Erlotinib sensitivity prediction in the CCLE database

http://biorxiv.org/content/biorxiv/early/2015/10/02/028159.full.pdf 比较了CCLE和TCGA的数据

一 08

TCGA数据里面的生存分析例子

Posted on 2016年1月8日 by ulwvfje

我们知道了生存分析，就是随着时间的流逝，死亡率是如何增加的，一般是用KM法来估计生存函数，然后画个图即可！而根据某些因子把样本分组，可以看到他们死亡率的变化趋势显著的不同，这就说明了我们的这个因子是非常有效的分类方式，这个因子可以是一个biomarker，也可以某些其它指标！

甚至，我们还可以用cox模型来分析这个因子是如何影响生存函数的，那个稍后再讲

这里，我们就简单讲一个例子，是TCGA里面卵巢癌的数据，根据甲基化数据分成了4个组，那么我们就下载这四个组样本的临床数据，

看看这样分组后，他们的死亡率变化趋势是不是有显著区别！

Continue reading →

一 08

生存分析简介

Posted on 2016年1月8日 by ulwvfje

一般我们谈生存分析，就是说的KM方法估计生存函数，并且画出生存曲线，然后还可以根据分组检验一下它们的生存曲线是否有显著的差异！

在R里面，非常的方便，有个包survival很容易就可以做生存分析了！

只需要记住三个函数即可：

Surv：用于创建生存数据对象
survfit：创建KM生存曲线或是Cox调整生存曲线
survdiff：用于不同组的统计检验

Continue reading →

一 05

寻找somatic突变的软件大合集

Posted on 2016年1月5日 by ulwvfje

其实somatic突变很容易理解，你测同一个人的正常组织和癌症组织，然后比较对这两个样本数据call出来的snp位点

只存在癌症组织数据里面的snp位点就是somatic突变，在两个样本都存在的snp位点就是germline的突变，不过一般大家研究的都是somatic突变。

当然，理论上是很简单，但是那么多统计学家要吃饭呀，肯定得把这件事搞复杂才行，所以就有了非常多的somatic突变 calling的软件，开个玩笑哈，主要是因为我们的测序并不是对单个细胞测序，我们通常意义取到的正常组织和癌症组织都不是纯的，所以会有很多关于这一点的讨论。

正好我看到了一篇帖子，收集了大部分比较出名的做somatic mutation calling的软件，当然，我只用过mutect和varscan。

来自于：https://www.biostars.org/p/19104/

Here are a few more, a summary of the other answers, and updated links:

deepSNV (abstract) (paper)
EBCall (abstract) (paper)
GATK SomaticIndelDetector (note: only available after an annoying sign-up and login)
Isaac variant caller (abstract) (paper)
joint-snv-mix (abstract) (paper)
LoFreq (abstract) (paper) (call on tumor & normal separately and then use a filter to derive somatic events)
MutationSeq (abstract) (paper)
MutTect (abstract) (paper) (note: only available after an annoying sign-up and login)
QuadGT (for calling single-nucleotide variants in four sequenced genomes comprising a normal-tumor pair and the two parents)
samtools mpileup - by piping BCF format output from this to bcftools view and using the '-T pair' option
Seurat (abstract) (paper)
Shimmer (abstract) (paper)
SolSNP (call on tumor & normal separately and then compare to identify somatic events)
SNVMix (abstract) (paper)
SOAPsnv
SomaticCall (manual)
SomaticSniper (abstract) (paper)
Strelka (abstract) (paper)
VarScan2 (abstract) (paper)
Virmid (abstract) (paper)

For a much more general discussion of variant calling (not necessarily somatic or limited to SNVs/InDels) check out this thread: What Methods Do You Use For In/Del/Snp Calling?

Some papers describing comparisons of these callers:

The ICGC-TCGA DREAM Mutation Calling challenge has a component on somatic SNV calling.

This paper used validation data to compare popular somatic SNV callers:

Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers

You'll need to update the link to MuTect. Broad Institute has begun to put portable versions of their tools on Github, like thelatest release of MuTect. The Genome Institute at WashU has been using Github for a while, but portable versions of their tools can be found here and here.

其实somatic的calling远比我们想象的要复杂：

To rehash/expand on what Dan said, if you're sequencing normal tissue, you generally expect to see single-nucleotide variant sites fall into one of three bins: 0%, 50%, or 100%, depending on whether they're heterozygous or homozygous.

With tumors, you have to deal with a whole host of other factors:

Normal admixture in the tumor sample: lowers variant allele fraction (VAF)
Tumor admixture in the normal - this occurs when adjacent normals are used, or in hematological cancers, when there is some blood in the skin normal sample
Subclonal variants, which may occur in any fraction of the cells, meaning that your het-site VAF might be anywhere from 50% down to sub-1%, depending on the tumor's clonal architecture and the sensitivity of your method
Copy number variants, cn-neutral loss of heterozygosity, or ploidy changes, all of which again shift the expected distribution of variant fractions

These, and other factors, make calling somatic variants difficult and still an area that is being heavily researched. If someone tells you that somatic variant calling is a solved problem, they probably have never tried to call somatic variants.

Sounds like somatic / tumor variant calling is something that will be solved by improvements at the wet lab side ( single cell selection / amplification / sequencing ) . Rather than at the computational side.

Well, single cell has a role to play (and would have more of one if WGA wasn't so lossy), but realistically, you can't sequence billions of cells from a tumor individually. Bulk sequencing still is going to have a role for quite a while.

Hell germ line calling isn't even a solved problem. Still get lots of false positives (and false negatives). It just tends to work so well that it is hard to improve it much except by making it faster, less memory intensive, etc

Solved was the wrong word. I just meant improved. There is only so much you can do at the computational side. Wet lab also has its part to play.

A germline variant caller generally has a ploidy-based genotyping algorithm built in to part of the algorithm/pipeline. I believe, IIRC, the GATK UnifiedGenotyper for instance does both variant calling and then genotype calling. So to call a genotype for a variant it is expecting a certain number of reads to support the alternative allele. When working with somatic variants all of the assumptions about how many reads you expect with a variant at a position to distinguish between true and false positives are no longer valid. Except for fixed mutations throughout the tumor population only some proportion of cells will hold a somatic variation. You also typically have some contamination from normal non-cancerous cells. Add in complications from significant genomic instability with lots of copy number variations and such and you have a need for a major change in your model for calling variation while minimizing artifactual calls. So you have a host of other programs that have been developed specifically for looking at somatic variation in tumor samples.

一篇文献：

Comparison of somatic mutation calling methods in amplicon and whole exome sequence data

是qiagen公司发的

High-throughput sequencing is rapidly becoming common practice in clinical diagnosis and cancer research. Many algorithms have been developed for somatic single nucleotide variant (SNV) detection in matched tumor-normal DNA sequencing. Although numerous studies have compared the performance of various algorithms on exome data, there has not yet been a systematic evaluation using PCR-enriched amplicon data with a range of variant allele fractions. The recently developed gold standard variant set for the reference individual NA12878 by the NIST-led “Genome in a Bottle” Consortium (NIST-GIAB) provides a good resource to evaluate admixtures with various SNV fractions.

Using the NIST-GIAB gold standard, we compared the performance of five popular somatic SNV calling algorithms (GATK UnifiedGenotyper followed by simple subtraction, MuTect, Strelka, SomaticSniper and VarScan2) for matched tumor-normal amplicon and exome sequencing data.

Nevertheless, detecting somatic mutations is still challenging, especially for low-allelic-fraction variants caused by tumor heterogeneity, copy number alteration, and sample degradation

We used QIAGEN’s GeneRead DNAseq Comprehensive Cancer Gene Panel (CCP, Version 1) for enrichment and library construction in triplicate。

QIAGEN’s GeneRead DNAseq Comprehensive Cancer Gene Panel (Version 1) was used to amplify the target region of interest (124 genes, 800 Kb).

When analyzing different types of data, use of different algorithms may be appropriate.

DNA samples of NA12878 and NA19129 were purchased from Coriell Institute. Sample mixtures were created based on the actual amplifiable DNA in each sample, resulting in 0%, 8%, 16%, 36%, and 100% of NA12878 sample mixed in the NA19129 sample, respectively.We treated the mixed samples at 8%, 16%, 36%, and 100% as the virtual tumor samples and the 0% as the virtual normal sample.

五个软件的算法是：

1. NaiveSubtract — SNVs were called separately from virtual tumor and normal samples using GATK UnifiedGenotyper [22]. For exome sequencing data, reads were already mapped, locally realigned and recalibrated by the 1,000 Genomes Project. So SNVs were directly called on the BAM files using GATK Unified Genotyper. Then, SNVs detected in the virtual normal sample were removed from the list of SNVs detected in the virtual tumor sample, leaving the “somatic” SNVs.

2. MuTect — MuTect is a method developed for detecting the most likely somatic point mutations in NGS data using a Bayesian classifier approach. The method includes pre-processing aligned reads separately in tumor and normal samples and post-processing resulting variants by applying an additional set of filters. We ran MuTect under the High-Confidence mode with its default parameter settings. We disabled the “Clustered position” filter and the “dbSNP filter” for the amplicon sequencing reads, and we disabled the “dbSNP filter” for the exome sequencing.

3. SomaticSniper — SomaticSniper calculates the Bayesian posterior probability of each possible joint genotype across the normal and cancer samples. We tuned the software’s parameters to increase sensitivity and then filtered raw results using a Somatic Score cut-off of 20 to improve specificity.

4. Strelka — Strelka reports the most likely genotype for tumor and normal samples based on a Bayesian probability model. Post-calling filters built into the software are based on factors such as read depth, mismatches, and overlap with indels. We skipped depth filtration for exome and amplicon sequencing data as recommended by the Strelka authors. For the amplicon sequencing reads, we set the minimum MAPQ score at 17 for consistency with the defaults in GATK UnifiedGenotyper. We used variants passing Strelka post-calling filters for analysis.

5. VarScan2 — VarScan2 performs analyses independently on pileup files from the tumor and normal samples to heuristically call a genotype at positions achieving certain thresholds of coverage and quality. Then, sites of the genotypes not matched in tumor and normal samples are classified into somatic, germline, or ambiguous groups using Fisher’s exact test. We generated the pileup files using SAMtools mpileup command.

The compatibility of the output VCF files between different methods as well as the NIST-GIAB gold standard was examined using bcbio.variation tools and manual inspection. The reported SNP call representations between files are comparable to each other.

来自于文献：http://www.biomedcentral.com/1471-2164/15/244

一 05

使用oncotator做突变注释

Posted on 2016年1月5日 by ulwvfje

功能：vcf格式突变数据进一步注释成maf格式

做过癌症数据分析的童鞋都知道，TCGA里面用maf格式来记录突变！那么maf格式的数据是如何得来的呢，我们都知道，做完snp-calling一般是得到vcf格式的突变记录数据文件，然后再用annovar或者其它蛋白结构功能影响预测软件注释一下，还远达不到maf的近100条记录。

而大名鼎鼎的broad institute就规定了maf格式的突变注释文件，他就是利用了十几个常见的已知数据库来注释我们得到的vcf突变记录，通常是对somatic的突变才注释成maf格式的数据！

大名鼎鼎的broadinstitute出品的突变注释工具：http://www.ncbi.nlm.nih.gov/pubmed/25703262

源码在github： https://github.com/broadinstitute/oncotator

软件官网是： https://www.broadinstitute.org/oncotator/

说明书: http://gatkforums.broadinstitute.org/gatk/discussion/4154/howto-install-and-run-oncotator-for-the-first-time

需要提前自己下载14G的数据：http://www.broadinstitute.org/~lichtens/oncobeta/oncotator_v1_ds_Jan262015.tar.gz

软件可以在官网下载：https://github.com/broadinstitute/oncotator/archive/v1.8.0.0.tar.gz

本身也是一个在线工具：

input data数据指南：https://www.broadinstitute.org/oncotator/help/#inputformat

集成了下面所有的分析资源

而且还提供了API

Genomic Annotations

Gene, transcript, and functional consequence annotations using GENCODE for hg19.
Reference sequence around a variant.
GC content around a variant.
Human DNA Repair Gene annotations from Wood et al.

Protein Annotations

Site-specific protein annotations from UniProt.
Functional impact predictions from dbNSFP.

Cancer Variant Annotations

Observed cancer mutation frequency annotations from COSMIC.
Cancer gene and mutation annotations from the Cancer GenCensus.
Overlapping mutations from the Cancer Cell Line Encyclopedia.
Cancer gene annotations from the Familial Cancer Database.
Cancer variant annotations from ClinVar.

Non-Cancer Variant Annotations

Common SNP annotations from dbSNP.
Variant annotations from 1000 Genomes.
Variant annotations from NHLBI GO Exome Sequencing Project (ESP).

因为要下载的数据有点多，我这里就不用自己的电脑测试了，安装过程也很简单的！

十二 29

用firehose_get 来下载所有TCGA寄存在broad的数据

Posted on 2015年12月29日 by ulwvfje

该软件是broad institute写的一个数据接口，主要是供他人下载TCGA的所有寄存在broad institute的免费数据，主要是level3，level4的数据。(说错了，好像只有level4的数据，就是可以发文章的分析结果及图片)

软件下载地址：https://confluence.broadinstitute.org/display/GDAC/Download

懂它的使用规则，编码规则即可：

https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm

就是一个很简单的shell脚本而已，根据几个用户自定义参数来选择性的下载数据。

我们可以用-t这个参数来指定下载的数据类型，可以是mut/rna/mutsig/gistic等各种数据，至于这些单词代表什么意义，需要自己去看说明书啦

还可以指定时间，截止到什么时间的数据！

它支持的癌症种类：

ACC  BLCA  BRCA  CESC  COAD  COADREAD  DLBC  ESCA  
	GBM  HNSC  KICH  KIRC  KIRP  LAML  LGG  LIHC  
	LUAD  LUSC  OV  PAAD  PANCANCER  PANCAN8  PANCAN12  PRAD  
	READ  SARC  SKCM  STAD  THCA  UCEC  UCS

这些癌症种类的简称，也是可以去官网里面看到的！官网：http://gdac.broadinstitute.org

十二 25

做癌症研究一定要把这几十篇TCGA的大文章看完

Posted on 2015年12月25日 by ulwvfje

都是发表在nature,cell还有新英格兰医学杂志上面的超级文章！每个文章附件都有一百多页，比博士论文还长，但是它们的分析套路其实都一样，都是那几种数据，包括WGS，WES，RNA-Seq，芯片表达量，miRNA表达量，甲基化数据，蛋白数据。分析过程也差不多，无法就是对癌症进行进一步的分类，癌症亚型，或者看看driver mutation，进一步解释癌症病变，转移，扩散机理，或者找标记物signature，辅助治疗等等，具体的要等我把这些文献看完了才能再进一步讲解，请做癌症研究方向的一定要把它们看完。

我已经下载完了，大家如果没有权限下载，就需要自己想办法啦！

非常值得大家阅读！！！

十一 05

使用mutsig软件来找驱动基因

Posted on 2015年11月5日 by ulwvfje

从数以万计的突变里面找到driver mutation这个课题很大，里面的软件我接触的就有十几个了，但是我尝试了其中几个，总是无法运行成功，不知道为什么，终于今天成功了一个，就是mutsig软件！其实关于突变数据找driver mutation ，台湾一个大学做了一个数据库DriverDB http://ngs.ym.edu.tw/driverdb/：还因此发了一篇文章：http://nar.oxfordjournals.org/content/early/2013/11/07/nar.gkt1025.full.pdf，挺不错的！

关于driver mutation的理论最近也进化了很多，算是比较完善了吧，但是我一直没时间静下心来好好补充理论知识，很多软件，都只是用过，很多数据，也只是处理了一下，不知道为什么要去做，╮(╯▽╰)╭扯远了，开始谈这个软件吧！

mutsig软件是broadinstitute出品的，所以可靠性非常好咯，来源于一篇nature文章：http://www.nature.com/nature/journal/v505/n7484/full/nature12912.html，而该软件的地址是：http://www.broadinstitute.org/cancer/cga/mutsig_run 需要简单注册才能下载的。

该nature文章是这样描述这个软件的优点的：We used the most recent version of the MutSig suite of tools, which looks for three independent signals: highmutational burden relative to background expectation, accounting for heterogeneity; clustering of mutations within the gene; and enrichment of mutations in evolutionarily conserved sites. Wecombined the significance levels (P values) fromeach test to obtain a single significance level per gene (Methods).

这个软件需要安装matlab环境才能使用，所以我前面就写了教程，如何安装！http://www.bio-info-trainee.com/?p=1166

如果已经安装好了matlab环境，那么直接下载这个软件就可以使用了，软件解压就OK拉，而且人家还提供了测试文件！

软件下载后，解压可以看到里面的一个脚本，软件说明书写的非常简单，当然，使用这个软件也的确非常简单：

run_MutSigCV.sh <path_to_MCR> mutations.maf coverage.txt covariates.txt output.txt 即可，其中所有的数据都是可以下载的，

运行完了测试数据，就证明你的软件安装没有问题啦！如果你只有突变数据的maf格式，maf格式可以参考：https://www.biostars.org/p/69222/ ，也可以使用该软件：如下

run_MutSigCV.sh <path_to_MCR> my_mutations.maf exome_full192.coverage.txt gene.covariates.txt my_results mutation_type_dictionary_file.txt chr_files_hg19

上面三个zip文件，都是可以在mutsig软件官网找到下载链接的，是必须下载的！使用很简单，就一个命令即可，但是把你的vcf突变数据做成该软件需要的maf格式，是一个难题！

九 24

broad_institute收集的癌症数据

Posted on 2015年9月24日 by ulwvfje

肾上腺皮质	Adrenocortical carcinoma	ACC	92	Browse	Browse
膀胱，尿路上皮	Bladder urothelial carcinoma	BLCA	412	Browse	Browse
乳腺癌	Breast invasive carcinoma	BRCA	1098	Browse	Browse
子宫颈	Cervical and endocervical cancers	CESC	307	Browse	Browse
胆管癌	Cholangiocarcinoma	CHOL	36	Browse	Browse
结肠腺癌	Colon adenocarcinoma	COAD	460	Browse	Browse
大肠腺癌	Colorectal adenocarcinoma	COADREAD	631	Browse	Browse
淋巴肿瘤弥漫性大B细胞淋巴瘤	Lymphoid Neoplasm Diffuse Large B-cell Lymphoma	DLBC	58	Browse	Browse
食管	Esophageal carcinoma	ESCA	185	Browse	Browse
FFPE试点二期	FFPE Pilot Phase II	FPPP	38	None	Browse
胶质母细胞瘤	Glioblastoma multiforme	GBM	613	Browse	Browse
脑胶质瘤	Glioma	GBMLGG	1129	Browse	Browse
头颈部鳞状细胞癌	Head and Neck squamous cell carcinoma	HNSC	528	Browse	Browse
肾嫌色	Kidney Chromophobe	KICH	113	Browse	Browse
泛肾	Pan-kidney cohort (KICH+KIRC+KIRP)	KIPAN	973	Browse	Browse
肾透明细胞癌	Kidney renal clear cell carcinoma	KIRC	537	Browse	Browse
肾乳头细胞癌	Kidney renal papillary cell carcinoma	KIRP	323	Browse	Browse
急性髓系白血病	Acute Myeloid Leukemia	LAML	200	Browse	Browse
脑低级神经胶质瘤	Brain Lower Grade Glioma	LGG	516	Browse	Browse
肝癌	Liver hepatocellular carcinoma	LIHC	377	Browse	Browse
肺腺癌	Lung adenocarcinoma	LUAD	585	Browse	Browse
肺鳞状细胞癌	Lung squamous cell carcinoma	LUSC	504	Browse	Browse
间皮瘤	Mesothelioma	MESO	87	Browse	Browse
卵巢浆液性囊腺癌	Ovarian serous cystadenocarcinoma	OV	602	Browse	Browse
胰腺癌	Pancreatic adenocarcinoma	PAAD	185	Browse	Browse
嗜铬细胞瘤和副神经节瘤	Pheochromocytoma and Paraganglioma	PCPG	179	Browse	Browse
前列腺癌	Prostate adenocarcinoma	PRAD	499	Browse	Browse
直肠腺癌	Rectum adenocarcinoma	READ	171	Browse	Browse
肉瘤	Sarcoma	SARC	260	Browse	Browse
皮肤皮肤黑色素瘤	Skin Cutaneous Melanoma	SKCM	470	Browse	Browse
胃腺癌	Stomach adenocarcinoma	STAD	443	Browse	Browse
胃和食管癌	Stomach and Esophageal carcinoma	STES	628	Browse	Browse
睾丸生殖细胞肿瘤	Testicular Germ Cell Tumors	TGCT	150	Browse	Browse
甲状腺癌	Thyroid carcinoma	THCA	503	Browse	Browse
胸腺瘤	Thymoma	THYM	124	Browse	Browse
子宫内膜癌	Uterine Corpus Endometrial Carcinoma	UCEC	560	Browse	Browse
子宫癌肉瘤	Uterine Carcinosarcoma	UCS	57	Browse	Browse
葡萄膜黑色素瘤	Uveal Melanoma	UVM	80	Browse	Browse

看起来癌症很多呀，任重道远

九 01

2012-LAD的三个亚型的不同生物学意义

Posted on 2015年9月1日 by ulwvfje

文献名：Differential Pathogenesis of Lung Adenocarcinoma Subtypes Involving Sequence Mutations, Copy Number, Chromosomal Instability, and Methylation
Lung adenocarcinoma (LAD)的遗传变异度很大。
这个癌症可以分成三类：The LAD molecular subtypes (Bronchioid, Magnoid, and Squamoid)
然后我们在三个subtypes里面分析了以下四个特征，发现不同subtypes差异非常显著。
1、Gene mutation rates (EGFR, KRAS, STK11, TP53),
2、chromosomal instability,
3、regional copy number
4、genomewide DNA methylation
另外三个临床特征也是很显著。
1、Patient overall survival,
2、cisplatin plus vinorelbine therapy response
3、predicted gefitinib sensitivity
所以，我们的分类非常好，而且对临床非常有帮助。
对LAD的研究数据包括
1，DNA copy number
2，gene sequence mutation
3，DNA methylation
4，gene expression
即使是TP53这样的基因在LAD的突变率也才35%，所以我们的LAD应该更加细分，因为EGFR mutation and KRAS mutation这样的突变对治疗很有指导意义，细分更加有助于临床针对性治疗方案的选择。
我们选取了116个LAD样本的数据，分析了1，genome-wide gene expression,,2，genomewide DNA copy number, 3，genome-wide DNA methylation, 4，selected gene sequence mutations
得到的结论是：LAD molecular subtypes correlate with grossly distinct genomic alterations and patient therapy response
数据来源如下：
Gene expression --> Agilent 44 K microarrays.
DNA copy number --> Affymetrix 250 K Sty and SNP6 microarrays.
DNA methylation --> MSNP microarray assay.
DNA from EGFR, KRAS, STK11 and TP53 exons --> ABI sequencers
我们用的是R语言包 ConsensusClusterPlus根据gene expression 来对我们的LAD进行分类molecular subtypes
分类的基因有506个(the top 25% most variable genes, 3,045, using ConsensusClusterPlus)，A nearest centroid subtype predictor utilizing 506 genes

这三类LAD的过表达基因参与不同的生物功能，
Bronchioid – excretion genes, asthma genes, and surfactants (SFTPB, SFTPC, SFTPD);
Magnoid – DNA repair genes, such as thymine-DNA glycosylase (TDG);
Squamoid – defense response genes, such as chemokine ligand 10 (CXCL10)
而且也对应不同的临床数据
Bronchioid had the most females, nonsmokers, early stage tumors, and low grade tumors, the greatest acinar content, the least necrosis, and the least invasion.
Squamoid had the most high grade tumors, the greatest solid content, and the lowest papillary content.
Magnoid had themost smokers and the heaviest smokers by pack years.
它们的基因突变pattern也有很大区别。
Bronchioid had the greatest EGFR mutation frequency
Magnoid had the greatest mutation frequencies in TP53, KRAS and STK11.
为了研究不同亚型癌症的突变模式的不同（genomewide mutation rates），我们同时又研究了a large set of rarely mutated genes (n = 623) from the Ding et al. cohort

结论：
Bronchioid subtype 更有可能受益于EGFR inhibitory therapy
Magnoid tumors also have severe genomic alterations including the greatest CIN, the most regional CN alterations, DNA hypermethylation, and the greatest genomewide mutation rate.
the Squamoid subtype displayed the fewest distinctive alterations that included only regional CN alterations

Page 4 of 51 2 345

生信菜鸟团

欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee

Category Archives: cancer

用GISTIC多个segment文件来找SCNA变异

2016-TCGA数据挖掘系列文章之癌症男女有别

TCGA数据挖掘系列文章之-pseudogene假基因探究

所有TCGA的maf格式somatic突变数据均可下载

突变频谱探究mutation siganures

我们进入了一个癌症研究非常好的时代？

使用可视化工具MutationMapper来看看基因上面突变的分布

用TCGA数据做cox生存分析的风险因子（比例风险模型）

用RNA-SeQC得到表达矩阵RPKM值

这个软件不仅仅能做QC，而且可以统计各个基因的RPKM值！尤其是TCGA计划里面的都是用它算的

一、程序安装

TCGA数据里面都会提供由RNA-SeQC软件处理得到的表达矩阵！

Expression

对CCLE数据库可以做的分析

Evaluating cell lines as tumour models by comparison of genomic profiles

CCLE数据库几个知识点

TCGA数据里面的生存分析例子

生存分析简介

寻找somatic突变的软件大合集

Comparison of somatic mutation calling methods in amplicon and whole exome sequence data

使用oncotator做突变注释

Genomic Annotations

Protein Annotations

Cancer Variant Annotations

Non-Cancer Variant Annotations

用firehose_get 来下载所有TCGA寄存在broad的数据

做癌症研究一定要把这几十篇TCGA的大文章看完

使用mutsig软件来找驱动基因

broad_institute收集的癌症数据

2012-LAD的三个亚型的不同生物学意义

2026年2月
一	二	三	四	五	六	日
« 九
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28