拿到比对后的sam/bam文件之后,这只能算是level2的数据,一般我们给他人share我们的结果也是直接给表达矩阵的, miRNA分析跟mRNA分析类似,但是它的表达矩阵更好获取一点。如果是mRNA,我们一般会跟基因组来比较,而基因组就那24条参考染色体,想知道具体比对到了哪个基因,需要根据基因组注释文件来写程序提取表达量信息,现在比较流行的是htseq这个软件,我前面也写过教程如何安装和使用,这里就不啰嗦了。但是对于miRNA,因为我比对的就是那1881条前体miRNA序列,所以直接分析比对的sam/bam文件就可以知道每条参考miRNA序列的表达量了。
## step6: counts the reads which mapping to each miRNA reference.## we need to exclude unmapped as well as multiple-mapped reads## XS:i:<n> Alignment score for second-best alignment. Can be negative. Can be greater than 0 in --local mode## NM:i:1 ## NM i Edit distance to the reference, including ambiguous bases but excluding clipping#The following command exclude unmapped (-F 4) as well as multiple-mapped (grep -v “XS:”) reads#samtools view -F 4 input.bam | grep -v "XS:" | wc -l## 180466//1520320##cat >count.hairpin.shls *hairpin.sam | while read iddosamtools view -SF 4 $id |perl -alne '{$h{$F[2]}++}END{print "$_\t$h{$_}" foreach sort keys %h }' > ${id%%_*}.hairpin.countsdone## bash count.hairpin.sh##cat >count.mature.shls *mature.sam | while read iddosamtools view -SF 4 $id |perl -alne '{$h{$F[2]}++}END{print "$_\t$h{$_}" foreach sort keys %h }' > ${id%%_*}.mature.countsdone## bash count.mature.sh
上面的代码,是我自己写的脚本来算表达量,非常简单,因为我没有考虑细节,直接想得到各个样本测序数据的表达量而已。如果是比对到了参考基因组,就要根据miRNA的gff注释文件用htseq等软件来计算表达量啦。
得到了表达量,就可以跟文献来做比较啦:
### step7: compare the results with paper'sGSM1470353: control-CM, experiment1; Homo sapiens; miRNA-Seq SRR1542714GSM1470354: ET1-CM, experiment1; Homo sapiens; miRNA-Seq SRR1542715GSM1470355: control-CM, experiment2; Homo sapiens; miRNA-SeqSRR1542716GSM1470356: ET1-CM, experiment2; Homo sapiens; miRNA-Seq SRR1542717GSM1470357: control-CM, experiment3; Homo sapiens; miRNA-Seq SRR1542718GSM1470358: ET1-CM, experiment3; Homo sapiens; miRNA-Seq SRR1542719### 下面我用R语言来检验一下,我得到的分析结果跟文章发表的结果的区别。a=read.table("bowtie_bam/SRR1542714.mature.counts")b=read.table("paper_results/GSM1470353_iPS_010313_Unstim_known_miRNA_counts.txt")plot(log(tmp[,2]),log(tmp[,3]))cor(tmp[,2],tmp[,3])##[1] 0.8413439
相关性还不错,总算没有分析错咯。
这个代码是我自己根据文章的理解写出的,因为我本身不擅长miRNA数据分析,所以在进行alignment的时候参数选择可能并不是那么友好,如果有高手能指正就最好了,可以直接打我电话告诉我,或者发邮箱给我,邮箱用户名是jmzeng1314,是163邮箱。