找个motif嘛，简单 | 生信菜鸟团

我在生信菜鸟团发布的自学CHIP-seq分析第八讲就提到过如何寻找motif，motif是比较有特征的短序列，会多次出现的，一般认为它的生物学意义重大，做完CHIP-seq分析之后，一般都会寻找motif 。查找有两种：

一是de novo的，要求的输入文件的fasta序列，一般是根据peak的区域的坐标提取好序列
二是依赖于数据库的搜寻匹配，很多课题组会将现有的ChIP-seq数据进行整合，提供更全面，更准确的motif数据库。

motif的英文定义如下：

motif: recurring pattern. eg, sequence motif, structure motif or network motif

DNA sequence motif: short, recurring patterns in DNA that are presumed to have a biological function.

从上边的定义可以看出，其实motif这个单词就是形容一种反复出现的模式，而序列motif往往是DNA上的反复出现的模式，并被假设拥有生物学功能。而且，经常是一些具有序列特异性的蛋白的结合位点（如，转录因子）或者是涉及到重要生物过程的（如，RNA 起始，RNA 终止， RNA 剪切等等）。

一篇文献列出了2014年以前的近乎所有知名的A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data 链接见：https://biologydirect.biomedcentral.com/articles/10.1186/1745-6150-9-4

当然，扯这么多大道理，没用。

直接上实战！

说大事专用分割线~

input是找到的peaks文件，bed格式

上游分析这里略过，我的GitHub里面给了全套流程代码：https://github.com/jmzeng1314/NGS-pipeline/tree/master/CHIPseq

反正得到了bed格式的peaks文件，而且一般下载公共数据在GEO里面作者都会顺带上传他们做好的peaks文件，比如：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE66581 就可以下载ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE66nnn/GSE66581/suppl/GSE66581_bed_files.tar.gz 文件。

因为是下载作者的数据，所以必须注意；参考基因组版本是：Genome_build: mm9

最常用的是 MEME工具套件：

http://meme-suite.org/ 输入文件是fasta序列，需要对peaks进行转换，根据bed的基因坐标从基因组里面提取对应的序列咯： http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

它里面集成了4个寻找motif 的工具，每个工具都是一篇文章，里面有详细描述具体原理，但是整个网页给人的感觉是too busy，让初学者无从下手，这次讲解就略过它咯。

homer软件来寻找motif

这个软件安装当初特别麻烦： https://github.com/jmzeng1314/NGS-pipeline/blob/master/CHIPseq/step8-Homer-findMotif.sh

但是现在有了conda，一句话搞定：conda install -c bioconda homer ,值得提醒的是，如果是在中国大陆，那么需要设置一下：

# https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
conda config --set show_channel_urls yes
conda config --show

下载软件及数据

conda install -y -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda homer
perl ~/miniconda2/share/homer-4.9.1-5/configureHomer.pl -install mm9 
ls -lh ~/miniconda2/share/homer-4.9.1-5/data/
## 作者使用的是mm9找到的peaks文件
## 数据库下载取决于网速咯
## 下载成功后会多出 ~/miniconda2/share/homer-4.9.1-5/data/genomes/mm9/ 文件夹, 共 4.8G
## 这个文件夹取决于你把homer这个软件安装到了什么地方。

## 或者用下面代码安装:

cd ~/biosoft
mkdir homer && cd homer
wget http://homer.salk.edu/homer/configureHomer.pl 
perl configureHomer.pl -install
perl configureHomer.pl -install hg19

## 软件安装之后就下载peaks文件，肯定是bed格式的
mkdir ~/motif
cd ~/motif
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE66nnn/GSE66581/suppl/GSE66581_bed_files.tar.gz 
tar zxvf GSE66581_bed_files.tar.gz

下载得到的bed格式的peaks文件如下：

motif/
|-- [3.4M] 1k_es_peaks.bed
|-- [2.5M] 200_es_peaks.bed
|-- [780K] 2cell_early_peaks.bed
|-- [415K] 2cell_early_trx_inhibitor_peaks.bed
|-- [2.3M] 2cell_peaks.bed
|-- [1.7M] 4cell_dba_peaks.bed
|-- [2.2M] 4cell_peaks.bed
|-- [2.1M] 50k_es_peaks.bed
|-- [1.9M] 8cell_peaks.bed
|-- [5.9M] GSE66581_bed_files.tar.gz
`-- [1.4M] icm_peaks.bed

homer软件找motif整合了两个方法，包括依赖于数据库的查询，和de novo的推断。

运行homer软件

但是使用起来很简单：http://homer.ucsd.edu/homer/ngs/peakMotifs.html

for id in *_peaks.bed;
do
echo $id
awk '{print $4"\t"$1"\t"$2"\t"$3"\t+"}' $id >homer_peaks.tmp 
findMotifsGenome.pl homer_peaks.tmp mm9 ${id%%.*}_motifDir -len 8,10,12
annotatePeaks.pl homer_peaks.tmp mm9 1>${id%%.*}.peakAnn.xls 2>${id%%.*}.annLog.txt 
done

不仅仅找了motif，还顺便把peaks注释了一下。

整个流程我是我的云服务器操作的，其中homer软件+mm9数据库约5G，下载的peaks及注释后的文件不到200M，软件安装及数据库下载，数据下载，注释，找motif的代码都给了，如果你想实践一下，可以考虑购买我的云服务器， 99 元一个月，不仅仅是找motif，更多详情请访问：古有杨志卖刀，今有jimmy售器

其它资源

还有一些R包可以，直接从BED文件里面记录的基因坐标来找motif，有的需要输入fasta序列，就需要自己根据bed的基因坐标从基因组里面提取对应的序列咯：

rGADEM (motif discovery): http://bioconductor.org/packages/devel/bioc/html/rGADEM.html

MotIV (motif validation): http://bioconductor.org/packages/devel/bioc/html/MotIV.html

http://lgsun.grc.nia.nih.gov/CisFinder/

http://bioinfo.cs.technion.ac.il/drim/

一	二	三	四	五	六	日
« 九
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30