25

自学miRNA-seq分析第三讲~公共测序数据下载

前面已经讲到了该文章的数据已经上传到NCBI的SRA数据中心,所以直接根据索引号下载,然后用SRAtoolkit转出我们想要的fastq测序数据即可。下载的数据一般要进行质量控制,可视化展现一下质量如何,然后根据大题测序质量进行简单过滤。所以需要提前安装一些软件来完成这些任务,包括: sratoolkit /fastx_toolkit /fastqc/bowtie2/hg19/miRBase/SHRiMP

下面是我用新服务器下载安装软件的一些代码记录,因为fastx_toolkit /fastqc我已经安装过,就不列代码了,还有miRBase的下载,我在前面第二讲里面提到过,传送门:自学miRNA-seq分析第二讲~学习资料的搜集 Continue reading

05

用snpEFF对vcf格式的突变数据进行注释

这个软件比较重要,尤其是对做遗传变异相关研究的,很多人做完了snp-calling后喜欢用ANNOVAR来进行注释,但是那个注释还是相对比较简单,只能得到该突变位点在基因的哪个区域,那个基因这样的信息,如果想了解更具体一点,就需要更加功能化的软件了,snpEFF就是其中的佼佼者,而且是java平台软件,非常容易使用!而且它的手册写的非常详细:http://snpeff.sourceforge.net/SnpEff_manual.html Continue reading

12

R包精讲第二篇:如何安装旧版本的包?

既然你点进来看,肯定是有需求咯!
一般来说,R语言自带的install.packages函数来安装一个包时,都是默认安装最新版的。
但是有些R包的开发者他会引用其它的一些R包,但是它用的是人家旧版本的功能,但他自己来不及更新或者疏忽了。
而我们又不得不用他的包,这时候就不得不卸载最新版包,转而安装旧版本包。

Continue reading

15

基因组各种版本对应关系

我是受到了SOAPfuse的启发才想到整理各种基因组版本的对应关系,完整版!!!
以后再也不用担心各种基因组版本混乱了,我还特意把所有的下载链接都找到了,可以下载任意版本基因组的基因fasta文件,gtf注释文件等等!!!
首先是NCBI对应UCSC,对应ENSEMBL数据库:
GRCh36 (hg18): ENSEMBL release_52.
GRCh37 (hg19): ENSEMBL release_59/61/64/68/69/75.
GRCh38 (hg38): ENSEMBL  release_76/77/78/80/81/82.
可以看到ENSEMBL的版本特别复杂!!!很容易搞混!
但是UCSC的版本就简单了,就hg18,19,38, 常用的是hg19,但是我推荐大家都转为hg38
看起来NCBI也是很简单,就GRCh36,37,38,但是里面水也很深!
Feb 13 2014 00:00    Directory April_14_2003
Apr 06 2006 00:00    Directory BUILD.33
Apr 06 2006 00:00    Directory BUILD.34.1
Apr 06 2006 00:00    Directory BUILD.34.2
Apr 06 2006 00:00    Directory BUILD.34.3
Apr 06 2006 00:00    Directory BUILD.35.1
Aug 03 2009 00:00    Directory BUILD.36.1
Aug 03 2009 00:00    Directory BUILD.36.2
Sep 04 2012 00:00    Directory BUILD.36.3
Jun 30 2011 00:00    Directory BUILD.37.1
Sep 07 2011 00:00    Directory BUILD.37.2
Dec 12 2012 00:00    Directory BUILD.37.3
可以看到,有37.1,   37.2,  37.3 等等,不过这种版本一般指的是注释在更新,基因组序列一般不会更新!!!
反正你记住hg19基因组大小是3G,压缩后八九百兆即可!!!
如果要下载GTF注释文件,基因组版本尤为重要!!!
对于ensembl:
变幻中间的release就可以拿到所有版本信息:ftp://ftp.ensembl.org/pub/
对于UCSC,那就有点麻烦了:
需要选择一系列参数:
2. Select the following options:
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions
track: UCSC Genes
table: knownGene
region: Select "genome" for the entire genome.
output format: GTF - gene transfer format
output file: enter a file name to save your results to a file, or leave blank to display results in the browser
3. Click 'get output'.
 现在重点来了,搞清楚版本关系了,就要下载呀!
UCSC里面下载非常方便,只需要根据基因组简称来拼接url即可:
或者用shell脚本指定下载的染色体号:
for i in $(seq 1 22) X Y M;
do echo $i;
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz;
## 这里也可以用NCBI的:ftp://ftp.ncbi.nih.gov/genomes/M_musculus/ARCHIVE/MGSCv3_Release3/Assembled_Chromosomes/chr前缀
done
gunzip *.gz
for i in $(seq 1 22) X Y M;
do cat chr${i}.fa >> hg19.fasta;
done
rm -fr chr*.fasta
12

bioconductor中文社区招募站长

我已经构建好bioconductor中文社区的雏形,大家可以进去看看!

写在前面

突然发现我的bioconductor.cn这个域名都快要过期了!

哈哈,才想起一年前的计划到现在还没开始实施,实在不像我的风格,可能是水平到了一定程度吧,很多初级工作不像以前那样事无巨细的把关了。正好,借这个机会找几个朋友帮我一起完成这个bioconductor中文社区计划!

 

Continue reading

18

生物信息学分析过程中常见文件格式

刚开始接触生物信息学的时候我也很纠结什么fastq,fastq,sam,bam,vcf,maf,gtf,bed,psl等等,甚至还有过时了的NCBI,ENSEMBL格式,如果是我刚开始 学的时候,我倒是很愿意把他们全部搞透彻,写详细的说明书,但是现在成长了,这些东西感觉很low了,正好我看到了一篇帖子讲数据格式的收集大全,分享给大家,希望初学者能多花点时间好好钻研!

https://www.biostars.org/p/55351/

每种文件格式的定义,都是有它的道理的,大部分是因为一个比较流行的软件,少量的数据格式是因为国际组织广泛认可而流行的
27

gene的symbol与entrez ID并不是绝对的一一对应的

很多时候,我们都无法确定到底是基于symbol来进行分析,还是基于entrez ID,当我们要进行ID转换的时候也想当然的以为它们的一一对应的, 但是最近我写了一个脚本来分析CCLE的数据的时候,发现其实有一些特例:

suppressMessages(library(org.Hs.eg.db))

all_symbol=mappedkeys(org.Hs.egSYMBOL2EG)

all_EGID =mappedkeys(org.Hs.egSYMBOL)
tmp=as.list(org.Hs.egSYMBOL2EG[all_symbol])
#tmp=as.list(org.Hs.egSYMBOL[all_EGID ])
tmp=unlist(lapply(tmp,length))
tmp=tmp[tmp>1]
as.list(org.Hs.egSYMBOL2EG[names(tmp)])
有多个entrez ID对应一个symbol的现象出现,但是没有一个symbol对应多个entrez ID的现象。而且entrez ID也会过期!
$CSNK1E
[1] "1454"      "102800317"
$HBD
[1] "3045"      "100187828"
$RNR1
[1] "4549" "6052"
$RNR2
[1] "4550" "6053"
$SFPQ
[1] "6421"   "654780"
$TEC
[1] "7006"      "100124696"
$MEMO1
[1] "7795"  "51072"
$KIR3DL3
[1] "115653"    "100133046"
$MMD2
[1] "221938"    "100505381"
$`LSAMP-AS1`
[1] "100506708" "101926903"
通过下面的链接可以看到具体情况
14

用broad出品的软件来处理bam文件几次遇到文件头错误

报错如下:ERROR MESSAGE: SAM/BAM file input.marked.bam is malformed: SAM file doesn't have any read groups defined in the header.  The GATK no longer supports SAM files without read groups !

有些人遇到的是bam的染色体顺序不一样,还有可能是染色体的名字不一样,比如>1和>chr1的区别,虽然很傻,但是遇到这样问题的还不少!
还有一些人是遇到基因组没有dict文件,也是用picard处理一下就好。

大部分人是在GATK遇到的,我是在RNA-SeQC遇到的,不过原理都是一样的。
都是因为做alignment的时候并未添加头信息,比如:
bwa samse ref.fa my.sai my.fastq > my.sam
samtools view -bS my.sam > my.bam
samtools sort my.bam my_sorted
java -jar ReordereSam.jar I=/path/my_sorted.bam O=/path/my_reordered.bam R=/path/ref.fa
通过这个代码可以得到排序好的bam,但是接下来用GATK就会报错
java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R /paht/ref.fa -I /path/aln_reordered.bam
就是因为没有头信息,group相关信息,解决方法有两种:
bwa samse -r @RG\tID:IDa\tSM:SM\tPL:Illumina ref.fa my.sai my.fastq > my.sam
java -jar AddOrReplaceReadGroups I=my.bam O=myGr.bam LB=whatever PL=illumina PU=whatever SM=whatever
一种是比对的时候就加入头信息,这个需要比对工具的支持。
第二种是用picard工具来修改bam,推荐用这个!虽然我其实并不懂这些头文件信息是干嘛的, 但是broad开发的软件就是需要!希望将来去读PHD能系统性的学习一些基础知识!

 

11

关于芯片平台GPL15308和GPL570

它们虽然被GEO数据标记了不同的ID号,但是其实都是一种芯片,都是昂飞公司的U133++2芯片,分析过芯片数据的人肯定不会陌生了

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL15308

事实上,这个平台应该是GPL570,但是被CCLE数据库给稍微变通了一下,就给了一个GPL15308的标签,平台主页也写的很清楚,它的探针ID是伪ID,其实就是entrez gene ID

1

本来这个芯片设计的是五万多个探针,最后只剩下了18926个基因
This array is identical to GPL570 but the data were analyzed with a custom CDF Brainarray Version 15, hgu133plus2hsentrezg.
十二 30

用GSEA来做基因集富集分析

how to use GSEA?
这个有点类似于pathway(GO,KEGG等)的富集分析,区别在于gene set(矫正好的基于文献的数据库)的概念更广泛一点,包括了

how to download GSEA ?

what's the input for the GSEA?

说明书上写的输入数据是:GSEA supported data files are simply tab delimited ASCII text files, which have special file extensions that identify them. For example, expression data usually has the extension *.gct, phenotypes *.cls, gene sets *.gmt, and chip annotations *.chip. Click the More on file formats help button to view detailed descriptions of all the data file formats.
实际上没那么复杂,一个表达矩阵即可!然后做一个分组说明的cls文件即可。
主要是自己看说明书,做出要求的数据格式:http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats
表达矩阵我这里下载GSE1009数据集做测试吧!
cls的样本说明文件,就随便搞一搞吧,下面这个是例子:
6 2 1
# good bad
good good good bad bad bad
文件如下,六个样本,根据探针来的表达数据,分组前后各三个一组。
clipboard
现在开始运行GSEA!

start to run the GSEA !

首先载入数据
clipboard
确定无误,就开始运行,运行需要设置一定的参数!
clipboard

what's the output ?

输出的数据非常多,对你选择的gene set数据集里面的每个set都会分析看看是否符合富集的标准,富集就出来一个报告。
点击success就能进入报告主页,里面的链接可以进入任意一个分报告。
最大的特色是提供了大量的数据集:You can browse the MSigDB from the Molecular Signatures Database page of the GSEA web site or the Browse MSigDB page of the GSEA application. To browse the MSigDB from the GSEA application:
 
有些文献是基于GSEA的:

 

十二 22

根据chrY独有区域的覆盖度及测序深度来判断性别

这个也是基于bam文件来的,判断chrY独有区域的覆盖度及测序深度
首先下载chrY独有区域的记录文件,https://www.familytreedna.com/documents/bigy_targets.txt
然后用samtools depth来统计测序深度,samtools  depth $i |grep 'chr[XY]'
depth统计结果文件如下:
mzeng@ubuntu:/home/jmzeng/gender_determination$ head Sample3.depth
chrX    60085    1
chrX    60086    1
chrX    60087    1
chrX    60088    1
chrX    60089    1
chrX    60090    1
chrX    60091    1
chrX    60092    1
chrX    60093    1
chrX    60094    1
然后我随便写了一个脚本来对测序深度文件进行再统计,统计覆盖度及测序深度

[perl]
open FH,"bigy_targets.txt";
while(<FH>){
 chomp;
 @F=split;
 $all+=$F[2]-$F[1]+1;
 foreach ($F[1]..$F[2]){
  $h{$_}=1;
 }
}
close FH;
open FH,$ARGV[0];
while(<FH>){
 chomp;
 @F=split;
 next unless $F[0] eq 'chrY';
 if (exists $h{$F[1]}){
  $pos++;
  $depth+=$F[2];
 }
}
close FH;
$average=$depth/$pos;
$coverage=$pos/$all;
print "$pos\t$average\t$coverage\n" ;

[/perl]

 

 
这样对那三个样本结果如下:
clipboard
可以看到只有sample4,是覆盖率极低的,而且记录到的pos位点也特别少,所以她是女性!
这里测序深度没有意义。
十一 05

点突变详解

DNA分子中某一个碱基为另一种碱基置换,导致DNA碱基序列异常,是基因突变的一种类型。可分为转换和颠换两类。转换(transitions)是同类碱基的置换(AT→GCGC→AT,颠换(transversions) 是不同类碱基的置换(AT→TACG,GC→CGTA

DNA substitution mutations are of two types. Transitions are interchanges of two-ring purines (A  G) or of one-ring pyrimidines (C  T): they therefore involve bases of similar shape. Transversions are interchanges of purine for pyrimidine bases, which therefore involve exchange of one-ring and two-ring structures.

我们在分析driver mutation的时候会区分各种点突变:

  • 1. CpG transitions
  • 2. CpG transversions
  • 3. C:G transitions
  • 4. C:G transversions
  • 5. A:T transitions
  • 6. A:T transversions

那么,我们有64种密码子,每种密码子都会有9种突变可能,我们如何得到一个所有的突变可能的分类并且打分表格呢?

类似于下面这样的表格:共576行!!!

head category.acgt
AAA>AAT 2 A T 6
AAA>AAC 2 A C 6
AAA>AAG 2 A G 5
AAA>ATA 1 A T 6
AAA>ACA 1 A C 6
AAA>AGA 1 A G 5
AAA>TAA 0 A T 6
AAA>CAA 0 A C 6
AAA>GAA 0 A G 5
AAT>AAA 2 T A 6

tail category.acgt
GGC>GGG 2 C G 2
GGG>AGG 0 G A 3
GGG>TGG 0 G T 4
GGG>CGG 0 G C 4
GGG>GAG 1 G A 3
GGG>GTG 1 G T 4
GGG>GCG 1 G C 4
GGG>GGA 2 G A 3
GGG>GGT 2 G T 4
GGG>GGC 2 G C 4

我本来以为这是一件很简单的事情,写起来,才发现好麻烦

1Capture

 

里面用到的一个函数如下:就是判断突变属于上述六种的哪一种!

2

参考:https://www.mun.ca/biology/scarr/Transitions_vs_Transversions.html

https://en.wikipedia.org/wiki/Transversion

http://www.uvm.edu/~cgep/Education/Mutations.html

突变,也称作单碱基替换(single base substitution),指由单个碱基改变发生的突变

可以分为转换(transitions)和颠换(transversions)两类。

转换:嘌呤和嘌呤之间的替换,或嘧啶和嘧啶之间的替换。

颠换:嘌呤和嘧啶之间的替换。

方便理解下面再附上一张示意图,如下:

 

Transitions_vs_Transversions

 

十一 01

WES(七)看de novo变异情况

de novo变异寻找这个也属于snp-calling的一部分,但是有点不同的就是该软件考虑了一家三口的测序文件,找de novo突变

软件有介绍这个功能:http://varscan.sourceforge.net/trio-calling-de-novo-mutations.html

而且还专门有一篇文章讲ASD和autism与de novo变异的关系,但是文章不清不楚的,没什么意思

Trio Calling for de novo Mutations

image001

Min coverage:   10

Min reads2:     4

Min var freq:   0.2

Min avg qual:   15

P-value thresh: 0.05

Adj. min reads2:        2

Adj. var freq:  0.05

Adj. p-value:   0.15

Reading input from trio.filter.mpileup

1371416525 bases in pileup file (137M的序列)

83123183 met the coverage requirement of 10 (其中有83M的测序深度大于10X)

145104 variant positions (132268 SNP, 12836 indel) (共发现15.5万的变异位点)

4403 were failed by the strand-filter

139153 variant positions reported (126762 SNP, 12391 indel)

502 de novo mutations reported (376 SNP, 126 indel) (真正属于 de novo mutations只有502个)

1734 initially DeNovo were re-called Germline

12 initially DeNovo were re-called MIE

3 initially DeNovo were re-called MultAlleles

522 initially MIE were re-called Germline

1 initially MIE were re-called MultAlleles

3851 initially Untransmitted were re-called Germline

然后我看了看输出的文件trio.mpileup.output.snp.vcf

软件是这样解释的:The output of the trio subcommand is a single VCF in which all variants are classified as germline (transmitted or untransmitted), de novo, or MIE.

  • FILTER - mendelError if MIE, otherwise PASS
  • STATUS - 1=untransmitted, 2=transmitted, 3=denovo, 4=MIE
  • DENOVO - if present, indicates a high-confidence de novo mutation call

里面的信息量好还是不清楚

我首先对我们拿到的trio.de_novo.mutaion.snp.vcf文件进行简化,只看基因型!
head status.txt   (顺序是dad,mom,child)
STATUS=2 0/0 0/1 0/1
STATUS=2 1/1 1/1 1/1
STATUS=2 0/1 0/0 0/1
STATUS=2 1/1 1/1 1/1
STATUS=1 0/1 0/0 0/0
STATUS=1 0/1 0/0 0/0
STATUS=2 1/1 1/1 1/1
STATUS=2 1/1 1/1 1/1
STATUS=2 1/1 1/1 1/1
STATUS=2 0/1 0/1 0/1
那么总结如下:
  26564 STATUS=1 无所以无 (0/0 0/1 0/0或者 0/1 0/0 0/0等等)
  97764 STATUS=2 有所以有 (1/1 1/1 1/1 或者0/1 0/1 1/1等等)
    385 STATUS=3 无中生有 (0/0 0/0 0/1 或者0/0 0/0 1/1)
   1485 STATUS=4 有中生无 (1/1 0/1 0/0 等等)

我用annovar注释了一下

/home/jmzeng/bio-soft/annovar/convert2annovar.pl -format vcf4  trio.mpileup.output.snp.vcf > trio.snp.annovar

/home/jmzeng/bio-soft/annovar/annotate_variation.pl -buildver hg19  --geneanno --outfile  trio.snp.anno trio.snp.annovar /home/jmzeng/bio-soft/annovar/humandb

结果是:

A total of 132268 locus in VCF file passed QC threshold, representing 132809 SNPs (86633 transitions and 46176 transversions) and 3 indels/substitutions

可以看到最后被注释到外显子上面的突变有两万多个

23794  284345 3123333 trio.snp.anno.exonic_variant_function

这个应该是非常有意义的,但是我还没学会后面的分析。只能先做到这里了

 

28

模拟Y染色体测序判断,并比对到X染色体上面,看同源性

首先下载两条染色体序列

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrX.fa.gz;

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrY.fa.gz;

152M Mar 21  2009 chrX.fa

58M Mar 21  2009 chrY.fa

然后把X染色体构建bwa的索引

bwa index chrX.fa

[bwa_index] Pack FASTA... 1.97 sec

[bwa_index] Construct BWT for the packed sequence...

[BWTIncCreate] textLength=310541120, availableWord=33850812

[BWTIncConstructFromPacked] 10 iterations done. 55838672 characters processed.

[BWTIncConstructFromPacked] 20 iterations done. 103157920 characters processed.

[BWTIncConstructFromPacked] 30 iterations done. 145211344 characters processed.

[BWTIncConstructFromPacked] 40 iterations done. 182584528 characters processed.

[BWTIncConstructFromPacked] 50 iterations done. 215797872 characters processed.

[BWTIncConstructFromPacked] 60 iterations done. 245313968 characters processed.

[BWTIncConstructFromPacked] 70 iterations done. 271543920 characters processed.

[BWTIncConstructFromPacked] 80 iterations done. 294853104 characters processed.

[bwt_gen] Finished constructing BWT in 88 iterations.

[bwa_index] 98.58 seconds elapse.

[bwa_index] Update BWT... 0.96 sec

[bwa_index] Pack forward-only FASTA... 0.91 sec

[bwa_index] Construct SA from BWT and Occ... 33.18 sec

[main] Version: 0.7.8-r455

[main] CMD: /lrlhps/apps/bioinfo/bwa/bwa-0.7.8/bwa index chrX.fa

[main] Real time: 141.623 sec; CPU: 135.605 sec

由于X染色体也就152M,所以很快,两分钟解决战斗!

然后模拟Y染色体的测序判断(PE100insert400

209M Oct 28 10:19 read1.fa

209M Oct 28 10:19 read2.fa

模拟的程序很简单

tmp

 

while(<>){
chomp;
$chrY.=uc $_;
}
$j=0;
open FH_L,">read1.fa";
open FH_R,">read2.fa";
foreach (1..4){
for ($i=600;$i<(length($chrY)-600);$i = $i+50+int(rand(10))){
$up = substr($chrY,$i,100);
$down=substr($chrY,$i+400,100);
next unless $up=~/[ATCG]/;
next unless $down=~/[ATCG]/;
$down=reverse $down;
$down=~tr/ATCG/TAGC/;
$j++;
print FH_L ">read_$j/1\n";
print FH_L "$up\n";
print FH_R ">read_$j/2\n";
print FH_R "$down\n";
}
}
close FH_L;
close FH_R;

然后用bwa mem 来比对

bwa mem -t 12 -M chrX.fa read*.fa >read.sam

用了12个线层,所以也非常快

[main] Version: 0.7.8-r455

[main] CMD: /apps/bioinfo/bwa/bwa-0.7.8/bwa mem -t 12 -M chrX.fa read1.fa read2.fa

[main] Real time: 136.641 sec; CPU: 1525.360 sec

643M Oct 28 10:24 read.sam

然后统计比对结果

samtools view -bS read.sam >read.bam

158M Oct 28 10:26 read.bam

samtools flagstat read.bam

3801483 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates

2153410 + 0 mapped (56.65%:-nan%)

3801483 + 0 paired in sequencing

1900666 + 0 read1

1900817 + 0 read2

645876 + 0 properly paired (16.99%:-nan%)

1780930 + 0 with itself and mate mapped

372480 + 0 singletons (9.80%:-nan%)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

我自己看sam文件也发现真的同源性好高呀,总共就模拟了380万reads,就有120万是百分百比对上了。

所以对女性个体来说,测序判断比对到Y染色体是再正常不过的了。如果要判断性别,必须要找那些X,Y差异性区段

21

一个MIT的博士要离开学术圈,结果引发了上千人的热烈讨论(下)

Dr Petra EichelsdoerferFebruary 17, 2014 at 11:17 AM

My university lacked the funds to support me while applying for my first grant after my post-doc. So I was unable to re-write it after its first rejection. Although it was heartbreaking to walk away, I had a family to support. This was nearly four years ago. How many others have had to make the same decision I have? What will be the long-term consequences?

我们大学也是资金紧张,所以我直到博士后出站才拿到资金的第一笔科研基金。大约四年前把,当我第一次被基金会拒绝后,我实在难以鼓起勇气再次申请基金。尽管我必须去申请,因为我还得养家糊口。会有多少人曾与我经历过同样的处境呢,又有多少人与我一样做了同样的选择呢? Continue reading

20

一个MIT的博士要离开学术圈,结果引发了上千人的热烈讨论(中)

的确是上千人进行讨论,我这里分两次翻译网友们的讨论结果:

Lenny TeytelmanFebruary 15, 2014 at 7:53 AM

The reason I did not want to publish this - a single voice is invariably dismissed. So, I want to assemble in a central place as many essays like this from students, postdocs, and professors. The funding crisis will not be addressed until the severity of it is acknowledged and NIH, politicians, and scientists are alarmed enough. Please e-mail me your stories to lenny at zappylab dot com (whether new or published elsewhere). I will put together a site aggregating all of them.

以前我不发表这个观点的原因是-势单力孤。所以我希望等到足够多的类似的观点由不同的学生,博士后,甚至教授提出来之后,把它们整合在一起。学术圈的资金危机很难解决,除非NIH本身意识到问题的严重性,而且最好是政客,科学家们也对此有着足够的警醒。所以请把你的经历发邮件给我,不管你以前是否也在其它什么场合抱怨过,我将对他们做一个汇总,统一发表出来。

BioluminessaFebruary 17, 2014 at 10:55 AM

Hi Lenny. Thank you for sharing your perspective! Here is another to add to your collection of essays. I wrote this piece the other day coming to similar conclusions. http://bioluminate.blogspot.com/2014/02/the-seven-stages-of-grief-for-academic.html

谢谢你分享你的观点,这里有个故事可能比较符合你的要求。

sheiselsewhereFebruary 25, 2014 at 11:54 AM

I often get told that I shouldn't be so negative and that things will get better. Unfortunately, I don't have the time to wait. Here is my contribution.

http://sheiselsewhere.mosdave.com/2014/02/16/singing-for-supper/

长久以来,人们就告诉我我不应该如此的悲观,一切都会好起来的。不幸的是,我已经没有足够的时间去等待事情好转了。下面是我的经历。

Dregev21February 28, 2014 at 2:13 PM

Wow, thank you for posting this! I have gone through a very similar situation and have also decided to quit pursuing this dream. I was a 4th year PhD student at the University of Florida (where I had already had to change labs since my first mentor moved to UAB) and my project was going nowhere fast. I also started seeing academia for what it has become; an industry of cheap labor and false hopes. But like you, I stayed in it for as long as I could because of my love for science, learning and teaching. I quit and got out with a MS degree this past November and I am very happy with my decision. I began working as a research coordinator at UF, making more money and like you felt liberated and free from the constant stress of graduate work and research. I believe most students come in to graduate scholl for the same reasons, but it has become so disheartening and scary, that it didn't seem worth it to me anymore. I think it is important for current students to know and understand that there are other things to do in life that are more fruitfull, less stressful and just as intelectually stimulating and rewarding. In any case, thank you for sharing!

哇,谢谢你的分享!我有着同样的经历,也准备放弃一直追寻的梦想了。我是佛罗里达大学的一名在读博士生,这是我博士生涯的第四年,而博士期间,我已经换过一次导师了,因为我的前导师去了UAB,所以我的博士课题也换了。就我的经历来看,我也认为学生圈现在充斥着廉价劳动力和虚假的期望。然而就像你一样,凭着对科学的一腔热血,我坚持了很长时间。直到去年十一月,我辍学了,就拿到了一个硕士学位,并且我非常开心我做了这样的决定。我的第一份工作是在UF做一个研究协调员,可以挣很多钱了,也终于从无止境的毕业研究课题压力中解放出来了,跟你一样的自由了。我觉得是时候让现在的学生知道在我们的生命中还有这非常多的更有意义,会收获更大,但却压力更小的事情可以做。

Travels with Moby March 1, 2014 at 12:58 PM

Good Luck to you. I made this choice for many similar reasons about 6 years ago. But I was a college professor at a small college. You have aptly described the scenario. Even without the added stress of the grant machine, the choices that we are forces to make that divorce ourselves from family, friends to pursue this academic dream are incredibly costly. What I did find after a year in industry, is that I was not alone, I met former academics in industry and elsewhere that have expressed the same concerns. I wish you the best, and you are not alone.

祝你好运!我也做了同样的决定,在六年前。但是我曾经是一个小学院的一名教授。你非常精准的描述了我们这样的教授的状态。即使没有基金方面的压力,我们这样的选择也使得我们无法兼顾家庭,朋友。追求学术的道理牺牲真的好大。当我离开学术圈,进入产业界的第一年,我终于不再觉得孤单了,我遇到以前学术圈的好友的时候也听到过他们有着类似的抱怨。不管怎么样,还是希望你能好好的,毕竟,我们都不孤单。

Kevin ZelnioMarch 3, 2014 at 8:51 AM

Good luck! Life is better outside academia lol. I left 2 phds (got my masters after first one) and a decade long research career with 12 and then a 5+ year science communication career, left the country and started a microbrewery in Sweden. My skills as a scientist have been instrumental in my new profession as a beer maker (serious lab and sanitation skills here!) and a business person (improved and more diverse funding sources! AKA investors and people who drink BEER - which is like everybody). I cried a lot, I won't lie. Almost wrecked my marriage and the stress turned me into a horrible father for a while. Its just not a sustainable career for some types of people. Which is a shame, because the career is selecting for the same type of people and missing out on a diversity of life styles which could most likely benefit the scientific community in a number of ways. Here was my story: http://deepseanews.com/2013/02/19294/

祝你好运!学术圈外面的世界更精彩,O(∩_∩)O哈哈哈~我有着长达12年的科学研究经历和多于5年的科学普及经历,终于离开了我的祖国,来到了瑞典开始酿酒。在我还是科学家的时候掌握的各种技能对我现在作为一个啤酒酿造师还是蛮有帮助的,尤其是严肃认真的实验习惯以及无菌操作技巧。那时候申请各种稀奇古怪的基金所锻炼出来的沟通技巧也特别适合自己现在转行做商人。我曾经为这样的转变哭泣过,我承认。这几乎摧毁了我的婚姻生活,也在一段时间让我变成了一个糟糕的父亲。对某些人来说,这样的职业不是一辈子的事情,而且是某种意义上的丢脸。以下是我的故事。

UnknownMarch 3, 2014 at 2:10 PM

I don't see why I should view your departure as
a bad sign for the life sciences. As an engineer,
we celebrate when our students graduate, go
start a company or join an existing one, and
create products that make the world a better
place. Or, go work at a national laboratory,
the FCC, a non-profit, or any of the other types of
jobs where engineers make a contribution.

我不认为,应该把你的离开看作是科学界的损失。作为工程师,我们欢迎毕业了的同学开公司或者加入已有的公司,去创造一些让人们生活更美好的产品。或者,干脆去国家实验室工作,或者FCC也行,一个非盈利组织,或者可以参加很多其它类似的,对工程师有需求的工作

Lenny TeytelmanMarch 3, 2014 at 2:26 PM

First of all, it is terrific that you are supportive of graduate students who go on to be productive outside of academia! Unfortunately, in life sciences, you often lose support of your mentor the second you say that you do not plan to be a tenure track professor.
Second, and most importantly - the reason our departures and anxieties are cause for concern - being a professor, in the current funding climate, requires a level of sacrifice for science that fewer and fewer of the most talented and brightest scientists will make. Our taxpayers spend an extraordinary amount funding research. If the best scientists leave academia, research will suffer. Training of the future scientists will suffer. Science, inside and outside of academia will suffer.

首先,像你这样的毕业生,走出学术圈之后还能创造如此多的产品,必然是非常了不起的。不幸的是,在生命科学领域,你经常就会失去你导师的资助,就像你博文里面提到的第二点那样,你也不打算去争取一个终身教职了。

其次,最重要的是我们的离开仅仅是因为焦虑,在如今的基金资助条件下,成为一个教授,需要一定程度的牺牲,只有极少数的非常聪明,非常有才智的科学家能做到了。我们的纳税人还是投入了大笔资金在科学研究的,如果大量的优秀科学家都离开了学术圈,那谈何学术成果呢。一个科学家的训练是需要耗费大量时间和金钱的,他们都将因此受损,同时,不管是学术圈里面还是外面都会蒙受巨大的损失。

gregMarch 3, 2014 at 3:07 PM

Your story collection is a great idea. I hope you'll keep the sources of the site open? I bet a lot of people would like to contribute to making that project stand out - I would certainly be helping out.

Thanks a ton for your blog post. Your last point about leaving your wife if she'd treated you as badly as science does is awesome. I'm just coming to the realization that you seem to already have: the notion that "If you can see yourself possibly loving any profession as much as you love science, you're not cut out for science" is unhealthy - it's a mark of the sort of brainwashing that academia does to you.

Best wishes on your future path.

你这个想把所有类似的学术圈失意的故事收集起来的想法很棒。我希望你能持续收集,并且保持开放。我也确信会有非常多的人参与进来贡献他们的故事,让这个计划传播开来。我也会尽我的努力去推动它。

能看到你这个博文真的是三生有幸。你最后那句话(如果你妻子对你像科学对你那样你肯定把她给甩了)简直是太精彩了。我也开始有着你曾经有过的想法了,如果你对追求科学还抱有疑问,是因为你的爱不够坚定,这样的想法是不对的,这只是学术圈对你的洗脑。

Jessica WilsonMarch 6, 2014 at 11:43 AM

This is fantastic writing, despite the sadness. I sympathize (finishing PhD in neuroscience, considering heading out).

I'd love to try and make a video with some of the stories you've accumulated. I'm already looking through that Google Doc you posted right now, and my heart is breaking.

文章写的真棒,尽管让人感到莫名的悲伤和失望。作为一个刚刚结束了神经科学学习课程的博士生,我深有同感。

我很乐意为你收集的这些故事拍一个video,我也看完了你在google doc里面所表达的观点,实在是太震撼了。

CBMarch 12, 2014 at 6:09 PM

Really great stuff. I have reread this post a dozen times over the past couple weeks, as I am a postdoc currently on the precipice of throwing in the towel on my academic career. I find the last sentence particularly meaningful. I can't shake the feeling that giving up on this career that I have been laser-focused on for ten years feels an awful lot like a traumatic breakup. But the simple truth is exactly as you described, academic science simply doesn't respect its professionals nearly enough for the best of us to stick around.

Ugh, breakups suck!

 

Michael RuddyMarch 14, 2014 at 1:55 PM

How appropriate for a Valentine post ... if you do not love everything about what you are doing – move on until you find it!

O(∩_∩)O哈哈~在情人节发表这样的观点实在是太适合不过了。如果你实在是不喜欢你正在做的事情,果断的放弃,持续寻找直到找到你所爱的。

Nick EffordFebruary 15, 2014 at 8:53 AM

I sympathise and wish you a successful and fulfilling future, wherever that takes you. The pressures in UK academia are much the same, as is the relatively low pay. We've seen our pay fall 13% taking into account inflation over the last 5 or 6 years, and universities refuse to offer a decent pay increase despite increasing their income from students and despite the fact that they are sitting on huge cash reserves. My own institution would rather spend £50 million on new buildings than reward its staff for their dedication.

Like you and countless others, I'm reluctant to leave a job that can be very exciting and stimulating. But the truth is that the stress levels make it increasingly unsustainable. There is constant pressure to write papers and secure research funding and simultaneous pressure to improve teaching quality, but there is a failure to recognise that time is a finite resource, so one activity must inevitably be traded off against the other.

I don't expect to receive the same remuneration as I would in industry, but I do need one of two things to happen: either working conditions need to improve or the pay needs to improve to reflect the real pressures of the job. I've sacrificed too many evenings and weekends over the years, and that has had a negative impact on personal physical and mental health as well as family relationships. If something doesn't give soon, I could well end up following you out of academia.

The trade union for academics in the UK is currently locked in a bitter pay dispute with the universities. You can find out more about it at http://fairpay.web.ucu.org.uk/

Good luck!

Nick

我也深有同感,也期望你有个成功而且精彩的未来,不管你走向哪个领域。英国的学术圈压力也很类似,因为相对来说基金资助量都很小。在过去的5到6年间,我们的资金总量,考虑到通货膨胀,反而缩减了13%,大学却拒绝对研究基金做一个比较像样的提高,尽管他们向学生收取的学费更多了,而且他们的现金流也非常健康。我所在的研究所情愿花五千万欧元区修建一个大楼,也不会去奖励那些辛辛苦苦奉献着的教职工。

像你以及无数的其他类似经历的人一样,我也不情愿离开这个即兴奋又刺激的研究工作。但是压力的与日俱增让我的研究工作越来越难以为继。我们既要保证发表足够的文章,还要争取研究基金的支持,同时还要提高我们的教学质量,但是我们不得不承认,一个人的时间是有限的,所以我们必然会在有些方面做得不够。

我不期望能得到与工业界相当的报酬,我仅仅是期望我们的工作条件能得到改善,并且我们的报酬水平应该能对得住我们实际上所承受的工作压力。长年以来,我已经付出了无数个夜晚和周末,而这对我个人的身心健康是一个很大的影响,同时也极大的影响了我的家庭关系。如果这一切不尽快改善的话,我想,我应该马上就会追寻你的脚步,离开这学术圈了。

16

最全面的转录组研究软件收集

能看到这个网站真的是一个意外,现在看来,还是外国人比较认真呀, 这份软件清单,能看出作者的确是花了大力气的,满满的都是诚意。from: https://en.wiki2.org/wiki/List_of_RNA-Seq_bioinformatics_tools
https://en.wiki2.org/wiki/List_of_RNA-Seq_bioinformatics_tools软件主要涵盖了转录组分析的以下18个方向,看我我才明白自己的水平的确没到家,印象中的转录组分析也就是差异表达,然后注释以下,最多分析一下融合基因,要不然就看看那些miRNA,和lncRNA咯,没想到里面的学问也大着呢,怪不得生物是一个大坑,来再多的学者也不怕,咱有的是研究方向给你。

    1 Quality control and pre-processing data
        1.1 Quality control and filtering data
        1.2 Detection of chimeric reads
        1.3 Errors Correction
        1.4 Pre-processing data
    2 Alignment Tools
        2.1 Short (Unspliced) aligners
        2.2 Spliced aligners
            2.2.1 Aligners based on known splice junctions (annotation-guided aligners)
            2.2.2 De novo Splice Aligners
                2.2.2.1 De novo Splice Aligners that also use annotation optionally
                2.2.2.2 Other Spliced Aligners
    3 Normalization, Quantitative analysis and Differential Expression
        3.1 Multi-tool solutions
    4 Workbench (analysis pipeline / integrated solutions)
        4.1 Commercial Solutions
        4.2 Open (free) Source Solutions
    5 Alternative Splicing Analysis
        5.1 General Tools
        5.2 Intron Retention Analysis
    6 Bias Correction
    7 Fusion genes/chimeras/translocation finders/structural variations
    8 Copy Number Variation identification
    9 RNA-Seq simulators
    10 Transcriptome assemblers
        10.1 Genome-Guided assemblers
        10.2 Genome-Independent (de novo) assemblers
            10.2.1 Assembly evaluation tools
    11 Co-expression networks
    12 miRNA prediction
    13 Visualization tools
    14 Functional, Network & Pathway Analysis Tools
    15 Further annotation tools for RNA-Seq data
    16 RNA-Seq Databases
    17 Webinars and Presentations
    18 References

 

16

3000多份水稻全基因组测序数据共享-主要是突变数据

感觉最近接触的生物信息学知识越多,越对大数据时代的到来更有同感了。现在的研究者,其实很多都可以自己在家里做了,大量的数据基本都是公开的, 但是一个人闭门造车成就真的有限,与他人交流的思想碰撞还是蛮重要的。

这里面列出了3000多份水稻全基因组测序数据,都共享在亚马逊云上面,是全基因组的双端测序数据,共3,024个水稻数据,比对到了五种不同的水稻参考基因组上面,而且主要是用GATK来找差异基因的。
而且,数据收集者还给出了一个snp calling的标准流程
我以前也是用这样的流程
SNP Pipeline Commands

1. Index the reference genome using bwa index

   /software/bwa-0.7.10/bwa index /reference/japonica/reference.fa

2. Align the paired reads to reference genome using bwa mem. 
   Note: Specify the number of threads or processes to use using the -t parameter. The possible number of threads depends on the machine where the command will run.

   /software/bwa-0.7.10/bwa mem -M -t 8 /reference/japonica/reference.fa /reads/filename_1.fq.gz /reads/filename_2.fq.gz > /output/filename.sam

3. Sort SAM file and output as BAM file

   java -Xmx8g -jar /software/picard-tools-1.119/SortSam.jar INPUT=/output/filename.sam OUTPUT=/output/filename.sorted.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE

4. Fix mate information

   java -Xmx8g -jar /software/picard-tools-1.119/FixMateInformation.jar INPUT=/output/filename.sorted.bam OUTPUT=/output/filename.fxmt.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE

5. Mark duplicate reads

   java -Xmx8g -jar /software/picard-tools-1.119/MarkDuplicates.jar INPUT=/output/filename.fxmt.bam OUTPUT=/output/filename.mkdup.bam METRICS_FILE=/output/filename.metrics VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000

6. Add or replace read groups

   java -Xmx8g -jar /software/picard-tools-1.119/AddOrReplaceReadGroups.jar INPUT=/output/filename.mkdup.bam OUTPUT=/output/filename.addrep.bam RGID=readname PL=Illumina SM=readname CN=BGI VALIDATION_STRINGENCY=LENIENT SO=coordinate CREATE_INDEX=TRUE

7. Create index and dictionary for reference genome

   /software/samtools-1.0/samtools faidx /reference/japonica/reference.fa
   
   java -Xmx8g -jar /software/picard-tools-1.119/CreateSequenceDictionary.jar REFERENCE=/reference/japonica/reference.fa OUTPUT=/reference/reference.dict

8. Realign Target 

   java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /output/filename.addrep.bam -R /reference/japonica/reference.fa -o /output/filename.intervals -fixMisencodedQuals -nt 8

9. Indel Realigner

   java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T IndelRealigner -fixMisencodedQuals -I /output/filename.addrep.bam -R /reference/japonica/reference.fa -targetIntervals /output/filename.intervals -o /output/filename.realn.bam 

10. Merge individual BAM files if there are multiple read pairs per sample

   /software/samtools-1.0/samtools merge /output/filename.merged.bam /output/*.realn.bam

11. Call SNPs using Unified Genotyper

   java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /reference/japonica/reference.fa -I /output/filename.merged.bam -o filename.merged.vcf -glm BOTH -mbq 20 --genotyping_mode DISCOVERY -out_mode EMIT_ALL_SITES
16

NGS数据比对工具持续收集

无意中看到了这个网站,比wiki的还有全面和专业。搜集了现有还算比较出名的比对软件,并且列出来了,还做了简单评价,里面对比对工具的收集,主要是基于2012年的一个综述《Tools for mapping high-throughput sequencing data》,相信应该是有不少人都看过这篇综述的,其实生物信息初学者应该自己去文献数据库找点感兴趣的关键词的综述多看看,广泛涉猎总没有坏处的。

<img src="http://www.ebi.ac.uk/~nf/hts_mappers/mappers_timeline.jpeg" alt="Mappers Timeline" width="800">

Features Comparison

The following Table enables a comparison of mappers based on different characteristics. The table can be sorted by column (just click on the column name). The data was collected from different sources and in some cases was provided by the developers. For execution times and memory requirements we refer to the above mentioned review (supplementary data is available here).

The Data column indicates if the mapper is specifically tailored for DNA, RNA, miRNA, or bisulfite sequences.The Seq.Plat. column indicates if the mapper supports natively reads from a specific sequencing platform or not (N). The version column indicates the version of the mapper considered. Read length limits are showed in two columns: minimum read length (Min. RL) and maximum read length (Max. RL.). Unless otherwise stated the unit is base pairs. The support for mismatches and short indels is also presented including, when possible, the maximum number of allowed mismatches and indels: by default the value is presented in bases; in some cases the value is presented as a percentage of the read size; or as score, meaning that mapper uses a score function. The alignments reported column indicate the alignments reported when a read maps to multiple locations. The alignment column indicates if the reads are aligned end-to-end (Globally) or not (Locally). The Parallel column indicates if the mapper can be run in parallel and, if yes, how: using a shared-memory (SM) or/and a distributed memory (DM) computer. The QA (quality awareness) column indicates if the mapper uses read quality information during the mapping. The support for paired-end/mate-pair reads is indicated in the PE column. The Splicing column indicates, for the RNA mappers, if the detection of splice junctions is made de novo or/and through user provided libraries (Lib). The Index column indicates if the reads or/and the reference are indexed. The number of citations was obtained from Google Scholar on 13 June 2015.
16

nature发表的统计学专题Statistics in biology

生物学里面,唯一还算有点技术含量,和有点门槛,就是生物统计了,而这也是绝大部分研究者的痛点,有能力的,可以看看nature上面关于统计学的专题讨论,而且主要是应用于自然科学的统计学讨论。

里面有几句统计学名言警句:
Statistics does not tell us whether we are right. It tells us the chances of being wrong.
统计学并不会告诉我们是否正确,而只是说明我们错误的可能性是多少。
Quality is often more important than quantity.
数据的质量远比数量要重要的多
The meaning of error bars is often misinterpreted, as is the statistical significance of their overlap.
Good experimental designs mitigate experimental error and the impact of factors not under study.
文章列表:
Research methods: Know when your numbers are significant
Scientific method: Statistical errors
Weak statistical standards implicated in scientific irreproducibility
The fickle P value generates irreproducible results
Vital statistics
Experimental biology: Sometimes Bayesian statistics are better
A call for transparent reporting to optimize the predictive value of preclinical research
Power failure: why small sample size undermines the reliability of neuroscience
Basic statistical analysis in genetic case-control studies
Erroneous analyses of interactions in neuroscience: a problem of significance
Analyzing 'omics data using hierarchical models
Advantages and pitfalls in the application of mixed-model association methods
Quality control and conduct of genome-wide association meta-analyses
Circular analysis in systems neuroscience: the dangers of double dipping
A solution to dependency: using multilevel analysis to accommodate nested data
How does multiple testing correction work?
What is Bayesian statistics?
What is a hidden Markov model?
下面的这些文章,其实就是我们正常课本里面统计学的知识点,但是放在nature杂志发表,就顿时高大上了好多
Points of significance: Importance of being uncertain
Points of Significance: Error bars
Points of significance: Significance, P values and t-tests
Points of significance: Power and sample size
Points of Significance: Visualizing samples with box plots
Points of significance: Comparing samples €”part I
Points of significance: Comparing samples part II
Points of significance:  Nonparametric tests
Points of significance: Designing comparative experiments
Points of significance: Analysis of variance and blocking
Points of Significance:  Replication
Points of Significance:  Nested designs
Points of Significance: Two-factor designs
Points of significance: Sources of variation
Points of Significance: Split plot design
Points of Significance: Bayes' theorem
Points of significance: Bayesian statistics
Points of Significance: Sampling distributions and the bootstrap
Points of Significance: Bayesian networks
A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.