06

Bioconductor包chimeraviz嵌合RNA可视化

Bioconductor包chimeraviz嵌合RNA可视化

高通量RNA测序已经能够更高效地检测融合转录本,但是融合检测的技术和相关软件通常产生高错误发现率。而一个自动整合RNA数据和已知基因组特征的可视化框架对于结果的检验是有帮助的。2017年发布的一个bioconductor包,chimeraviz就可以做到自动创建嵌合RNA可视化。

支持来自9种不同融合发现工具(deFuseEricScript、InFusion、JAFFA、FusionCatcher、FusionMap、PRADA、SOAPfuse和STAR-FUSION)的输入。 Continue reading

15

融合基因检测软件-soapfusion

开发单位:华大,SOAP系列软件套装!

功能:检测合基因
优点:在现有的各种软件里面表现算是最好的
算法:是hash index,跟其它bwt算法不太一样
其它软件有: FusionSeq [21], deFuse [22], TopHat-Fusion [23], FusionHunter [24], SnowShoes-FTD [25], chimerascan [26] and FusionMap [27]
具体的算法我没看,因为只是有需求,正好有一些RNA-seq数据又想看看样本融合基因情况。所以就测试这个软件,通俗点说,融合基因原理其实很简单,如果有足够多的reads一部分比对到一个基因,另一部分比对到另一个基因,就可以说明它们两个基因发生了融合现象!如果是PE测序,那么更方便,左右两端reads比对情况也可以考虑。我就不多说废话了,直接上教程吧!
一,软件安装
下载压缩包,解压后即可使用!!!
推荐用最新版,然后看作者说明书的时候也要看清楚!
我反正好几次都搞糊涂了,最后联系了作者才搞明白,作者说他想更新到2.0版本,直接用HISAT的比对sam文件来做,但是还在筹备中,我觉得有点悬!
1
解压后是一堆perl程序,都在source目录下,source目录下面还有bin下面附带了几个第三方软件,包括bwa,blast和soap,最后都用得着!
有个很重要的问题,一定要软件自带的perl模块添加到perl的环境变量。不然那些perl程序运行会报错!
配置文件需要修改,就把几个目录放进去即可

二,输入数据准备

这里最重要的就是制作数据库!!!
作者给了非常详细的制作过程,我觉得还是不够清楚,所以再讲一遍!
首先下载5个文件:
6.5K Jun 15  2009 cytoBand.txt.gz
3.0G Oct 12  2012 hg19.fa
2.5M Mar 15 10:30 HGNC_Gene_Family_dataset
38M Feb  8  2014 Homo_sapiens.GRCh37.75.gtf.gz
202 Jan 19 16:07 HumanRef_refseg_symbols_relationship.list

文件下载地址,作者已经给出了!

我把这些文件都放在的当前文件夹下面的raw这个子文件夹,因为我要当前文件夹作为该软件的database文件夹!!!
然后运行命令!
我在SOAPfuse-v1.27文件下面运行:
perl ../SOAPfuse-v1.27/source/SOAPfuse-S00-Generate_SOAPfuse_database.pl  \
-wg raw/hg19.fa  -gtf raw/Homo_sapiens.GRCh37.75.gtf.gz  -cbd raw/cytoBand.txt.gz   -gf raw/HGNC_Gene_Family_dataset \
-rft raw/HumanRef_refseg_symbols_relationship.list \
 -sd ../SOAPfuse-v1.27 -dd ./

这一步耗时很长,4~6小时,创造了transcript.fa和gene.fa,然后还对他们建立bwa和soap的index,所以有点慢!

构建成功会有提示:
Congratulations!
You have constructed SOAPfuse database files successfully.
These database files are all stored in directory you supplied:
/home/jmzeng/biosoft/SOAPfuse/db_v1.27/
They are all generated based on public data files you supplied:
whole_genome_fasta_file:   /home/jmzeng/biosoft/SOAPfuse/db_v1.27/raw/hg19.fa
gtf_annotation_file:       /home/jmzeng/biosoft/SOAPfuse/db_v1.27/raw/Homo_sapiens.GRCh37.75.gtf.gz
Chr_Bandregion_file:       /home/jmzeng/biosoft/SOAPfuse/db_v1.27/raw/cytoBand.txt.gz
HGNC_gene_family_file:     /home/jmzeng/biosoft/SOAPfuse/db_v1.27/raw/HGNC_Gene_Family_dataset
gtf_segname2refseg_list:   /home/jmzeng/biosoft/SOAPfuse/db_v1.27/raw/HumanRef_refseg_symbols_relationship.list
这些目录很重要,接下来制作配置文件会用得着!
To use these database files, just set the 'DB_db_dir' in config file as belowed:
DB_db_dir  =   /home/jmzeng/biosoft/SOAPfuse/db_v1.27
配置文件需要修改下面5个
DB_db_dir = /DATABASE_DIR/
PG_pg_dir = /TOOL_DIR/source/bin
PS_ps_dir = /TOOL_DIR/source
PD_all_out = /out_directory/
PA_all_fq_postfix = PostFix
其实你仔细阅读了说明书,你就知道该修改成什么样子了!
最后制作sample list文件
我这里只有一个sample,所以文件就一句话即可
test test test 100
所以我的有下面两个文件,都是为了顺应作者的需求我才搞了test/test/test这么无聊的东西!!!
/home/jmzeng/test_for_soapfuse/test/test/test_1.fq.gz
/home/jmzeng/test_for_soapfuse/test/test/test_2.fq.gz
如果你有多个sample需要一起运行,你就要仔细读作者的readme了,它把这个配置文件搞得特别复杂!!!

三,运行命令

如果文件都准备好了,运行命令非常简单!!
perl SOAPfuse-RUN.pl -c <config_file> -fd <WHOLE_SEQ-DATA_DIR> -l <sample_list> -o <out_directory> [Options]

运行的非常慢!!!

因为需要重新比对,知道

四,数据结果解读

结果,作者已经说的很清楚了,我就不多说了!
16

居然还可以出售TCGA的数据,只有你稍微进行分析一下即可

亮瞎了我的双眼,原来还可以这样挣钱。
这个数据库的作者在2011年发了一篇如何寻找融合基因的文章:*Edgren, Henrik, et al. "Identification of fusion genes in breast cancer by paired-end RNA-sequencing." Genome Biol 12.1 (2011): R6.

然后基于此,把TCGA计划里面的所有癌症样本数据都处理了,并且得到了融合基因数据集,然后就以此出售

价格高达一万欧元,折合人民币七万多,一本万利,而且人家TCGA计划的数据的公开而且免费的,他做了二次处理就可以拿来挣钱,让我感觉很不爽。
到目前为止他们处理了TCGA计划里面的7652个癌症样本的数据,建立了一个囊括28种癌症的融合基因数据集,并且打包成了一个叫做FusionSCOUT 的产品来出售。
价格如下:

Pricing of FusionSCOUT datasets:

  • Single gene in one cancer set                        490€    /  580$ per dataset
  • Single gene fusions across all cancers          4900€  /  5800$ dataset
  • Individual cancer set                                       990 €   /  1250 $ per dataset
  • Full TCGA dataset                                          9900€  /  12500$ per dataset
该网站是这样介绍他们的产品的,号称有3500个研究团体已经使用了他们的数据,但是我感觉纯粹是吹牛,毕竟他这篇文献也就一百多的引用量,再说3500次购买,就这一个产品就能让他成为亿万富翁了,想想都觉得可怕。而且这网站这么烂,中国访问速度是渣渣,也就是相当于失去了中国的所有土豪客户了,怎么可能还有3500的销量,搞笑!

One of the latest therapeutics angles in the fight against cancer is fusion genes and their regulation. To aid in fusion gene research and reveal the multitude of gene fusion event in cancer samples MediSapiens has developed a proprietary FusionSCOUT pipeline for identifying fusion genes from RNA sequencing datasets.

Currently we have analysed 7625 tumour samples from the TCGA project building a fusion gene dataset covering 28 different cancers within the TCGA project which can be accessed through our FusionSCOUT product.

Using this pipeline, we have discovered 3930 samples with gene fusions with 9667 different fusion genes. We´ve discovered numerous novel gene fusions as well as new cancer types in which previously known fusions appear.

You can now purchase these gene fusions datasets with few mouse clicks and get the worlds most comprehensive gene fusions from cancer sets within days

FusionSCOUT cancer Reports

With FusionSCOUT you can access the full listings of all fusion genes in specific cancer datasets. Find new leads for possible cause of the cancer, examine the pathways that are affected by different fusions, stratify patients by shared fusion genes or search for potential target for drugs and companion diagnostics.

Once you purchase a FusionSCOUT dataset we will send you a detailed report with information on the fused genes, sample ID from the TCGA dataset, fusion frequencies across the dataset as well as fusion mRNA sequences and lists of protein domains present in the fusion transcripts.

By ordering the MediSapiens FusionSCOUT dataset, you´ll get:

  • A list of all gene fusions that involve your gene of interest, across all TCGA cancer types
  • TCGA sample ID: s of the for the samples with fusions
  • Exact exon junctions for the fusions, including alternatively spliced variants and data on whether reading frame is retained
  • Detailed list of protein domains retained in the fusion genes
  • cDNA sequence for the fusion mRNAs

Contact us to access the most up-to-date and comprehensive datasets of fusion gene events in different cancers!contact@medisapiens.com

Check out also our Fusion Gene Detection pipeline service for your samples!

Dataset missing? Email us and well add your favorite dataset to FusionSCOUT!

FusionSCOUT Cancer sets, March 2015

Cancer type Number of samples Number of fusion genes
Acute Myeloid Leukemia, LAML 153 69
Adrenocortical carcinoma, ACC 79 115
Bladder Urothelial Carcinoma, BLCA 273 473
Brain Lower Grade Glioma, LGG 467 309
Breast Invasive Carcinoma, BRCA 1029 3267
Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma, CESC 195 190
Colon Adenocarcinoma, COAD 287 212
Glioblastoma multiforme, GBM 170 379
Head and Neck Squamous Cell Carcinoma, HNSC 412 386
Kidney Chromophobe, KICH 66 19
Kidney Renal Clear Cell Carcinoma, KIRC 523 217
Kidney Renal Papillary Cell Carcinoma, KIRP 226 145
Liver Hepatocellular Carcinoma, LIHC 198 317
Lung Adenocarcinoma, LUAD 456 991
Lung Squamous Cell Carcinoma, LUSC 482 1374
Lymphoid Neoplasm Diffuse Large B-cell Lymphoma, DLBC 28 18
Mesothelioma, MESO 36 26
Ovarian Serous Cystadenocarcinoma, OV 420 1166
Pancreatic Adenocarcinoma, PAAD 84 46
Pheochromocytoma and Paraganglioma, PCPG 184 83
Prostate Adenocarcinoma, PRAD 336 859
Rectum Adenocarcinoma, READ 85 74
Sarcoma, SARC 161 799
Skin Cutaneous Melanoma, SKCM 355 620
Stomach Adenocarcinoma, STAD 190 311
Thyroid Carcinoma, THCA 506 195
Uterine Carcinosarcoma, UCS 57 229
Uterine Corpus Endometrial Carcinoma, UCEC 167 422