首先下载两条染色体序列
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrX.fa.gz;
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrY.fa.gz;
152M Mar 21 2009 chrX.fa
58M Mar 21 2009 chrY.fa
然后把X染色体构建bwa的索引
bwa index chrX.fa
[bwa_index] Pack FASTA... 1.97 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=310541120, availableWord=33850812
[BWTIncConstructFromPacked] 10 iterations done. 55838672 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 103157920 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 145211344 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 182584528 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 215797872 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 245313968 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 271543920 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 294853104 characters processed.
[bwt_gen] Finished constructing BWT in 88 iterations.
[bwa_index] 98.58 seconds elapse.
[bwa_index] Update BWT... 0.96 sec
[bwa_index] Pack forward-only FASTA... 0.91 sec
[bwa_index] Construct SA from BWT and Occ... 33.18 sec
[main] Version: 0.7.8-r455
[main] CMD: /lrlhps/apps/bioinfo/bwa/bwa-0.7.8/bwa index chrX.fa
[main] Real time: 141.623 sec; CPU: 135.605 sec
由于X染色体也就152M,所以很快,两分钟解决战斗!
然后模拟Y染色体的测序判断(PE100,insert400)
209M Oct 28 10:19 read1.fa
209M Oct 28 10:19 read2.fa
模拟的程序很简单
while(<>){
chomp;
$chrY.=uc $_;
}
$j=0;
open FH_L,">read1.fa";
open FH_R,">read2.fa";
foreach (1..4){
for ($i=600;$i<(length($chrY)-600);$i = $i+50+int(rand(10))){
$up = substr($chrY,$i,100);
$down=substr($chrY,$i+400,100);
next unless $up=~/[ATCG]/;
next unless $down=~/[ATCG]/;
$down=reverse $down;
$down=~tr/ATCG/TAGC/;
$j++;
print FH_L ">read_$j/1\n";
print FH_L "$up\n";
print FH_R ">read_$j/2\n";
print FH_R "$down\n";
}
}
close FH_L;
close FH_R;
然后用bwa mem 来比对
bwa mem -t 12 -M chrX.fa read*.fa >read.sam
用了12个线层,所以也非常快
[main] Version: 0.7.8-r455
[main] CMD: /apps/bioinfo/bwa/bwa-0.7.8/bwa mem -t 12 -M chrX.fa read1.fa read2.fa
[main] Real time: 136.641 sec; CPU: 1525.360 sec
643M Oct 28 10:24 read.sam
然后统计比对结果
samtools view -bS read.sam >read.bam
158M Oct 28 10:26 read.bam
samtools flagstat read.bam
3801483 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
2153410 + 0 mapped (56.65%:-nan%)
3801483 + 0 paired in sequencing
1900666 + 0 read1
1900817 + 0 read2
645876 + 0 properly paired (16.99%:-nan%)
1780930 + 0 with itself and mate mapped
372480 + 0 singletons (9.80%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
我自己看sam文件也发现真的同源性好高呀,总共就模拟了380万reads,就有120万是百分百比对上了。
所以对女性个体来说,测序判断比对到Y染色体是再正常不过的了。如果要判断性别,必须要找那些X,Y差异性区段