23

生物信息小白如何自学编程

这本来是我在知乎上面看到的问题,所以就抽空回答了一下:http://www.zhihu.com/question/36701137/answer/68928111

首先,你懂得想去看源码,这是一个很好的兆头,一些非常正规的源码的确是编程进阶的的捷径,毕竟我们大部分人都不可能得到别人的手把手指导,所以只能靠自己的悟性了。

我就以我自己的经历来回答这个问题吧,我作为一个纯生物出身的小白,现在编程技术应该还算可以了!

首先,不管是哪个语言,perl,python,R,matlab都好,它们都有一堆的基础书籍,你必须以囫囵吞枣的心态看完一两本书(书没有好坏,别要我给你推荐书名),必须看完,了解编程基础。

接下来的步骤最重要,就是实践,不停的实践,在实践中运用编程技术,这样是学的最快的,不然你看再多的书也只是一个概念。

我这里重点推荐一个工具集,它实现了很多生物信息学需要的常用操作,网址是:Bioinformatics Tools
包含以下64中工具,而且网页也很清楚的描述了它们的功能,其实非常简单,但是这样写程序非常有效。
"Combines multiple FASTA entries into a single sequence."
"Returns the entire sequence contained in an EMBL file in FASTA format."
"Parses the feature table of an EMBL file and returns the feature sequences."
"Parses the feature table of an EMBL file and returns the protein translations."
"Removes non-DNA characters from text."
"Removes non-protein characters from text."
"Returns the entire sequence contained in a GenBank file in FASTA format."
"Parses the feature table of a GenBank file and returns the feature sequences."
"Parses the feature table of a GenBank file and returns the protein translations."
"Converts single letter amino acid codes to three letter codes."
"Reads a list of positions and ranges and returns those parts of a DNA sequence."
"Reads a list of positions and ranges and returns those parts of a protein sequence."
"Determines the reverse-complement, reverse, or complement of the sequence you enter."
"Separates bases according to codon position."
"Converts a FASTA sequence into multiple sequences."
"Converts three letter amino acid codes to one letter codes."
"Returns DNA sequence segments specified by a position and window size."
"Returns protein sequence segments specified by a position and window size."
"Plots codon frequency (according to the codon table you enter) for each codon in a DNA sequence."
"Returns a standard codon usage table."
"Returns a list of potential CpG islands."
"Calculates the molecular weight of DNA sequences."
"Returns positions of the patterns you enter."
"Returns basic sequence statistics."
"Returns sequences that are identical or similar to a query sequence."
"Returns sequences that are identical or similar to a query sequence."
"Accepts aligned sequences in FASTA format and calculates the identity and similarity of each sequence pair."
"Can be used to predict a DNA sequence in another species using a protein sequence alignment."
"Finds DNA sequences that can easily be converted to a restriction site."
"Determines the positions of open reading frames."
"Returns the optimal global alignment for two coding DNA sequences."
"Returns the optimal global alignment for two DNA sequences."
"Returns the optimal global alignment for two protein sequences."
"Returns a report describing PCR primer properties"
"Generates PCR products from a template and two primer sequences."
"Returns the grand average of hydropathy value of protein sequences."
"Returns the predicted isoelectric point of protein sequences."
"Calculates the molecular weight of protein sequences."
"Returns positions of the patterns you enter."
"Returns basic sequence statistics."
"Converts the sequence you enter into restriction fragments."
"Returns the number and positions of restriction sites."
"Can be used to convert protein into DNA."
"Returns the translation in the reading frame you specify."
"Colors a sequence alignment based on sequence conservation."
"Colors a protein alignment based on biochemical properties of residues."
"Numbers and groups DNA according to your specifications."
"Numbers and groups amino acids according to your specifications."
"Shows PCR primer annealing sites, translations, and restriction sites."
"Shows restriction sites and protein translations."
"Shows protein translations."
"Introduces random mutations into DNA sequences."
"Introduces random mutations into protein sequences."
"Generates a random coding sequence of the length you specify."
"Generates a random DNA sequence of the length you specify."
"Replaces regions of the DNA sequences you enter with random bases."
"Generates a random protein sequence of the length you specify."
"Replaces regions of the protein sequences you enter with random residues."
"Samples bases from a DNA sequence with replacement."
"Samples residues from a protein sequence with replacement."
"Randomly shuffles the DNA sequences you enter."
"Randomly shuffles the protein sequences you enter."
"IUPAC codes for DNA and protein."
"The genetic codes used in the Sequence Manipulation Suite."
当你实现完了这些需求,你不仅仅学会了编程,而且是学会了编程该如何应用在生物信息学里面!
用perl,python,R,matlab中的任何一种都可以实现,它们没有任何区别的,别纠结语言的问题。
不推荐初学者看源代码,因为源代码太正规了,定义变量就几十行代码了,再定义函数又是几百行代码,而真正学生物信息学的压根写代码都不超过五十行的,比如我上面提到那64个生物数据处理需求,一般就七八行代码就可以(在perl里面)
不信你可以看看这个github里面托管的代码:trinityrnaseq/util/misc at master · trinityrnaseq/trinityrnaseq · GitHub
里面有很多perl代码,都是实现各种数据转换的,写的非常正规,甚至能把一行代码就能解决的问题写成几百甚至上千行,除非你想把自己的代码拿去发文章或者出售,否则正常的生物信息学研究根本用不着!
当然,回到你最初的问题,哪里能找到源码呢?
首先,你可以去图书馆看一堆书籍,它们都会有光盘,下载既有视频又有源码,或者书上一般会说源码在哪里下载,比如这个pleac/include/perl at master · pleac/pleac · GitHub
然后,你可以找一大堆的生物信息学软件,它们一般都托管在github上面,这个链接里面有三百多个生物信息学转录组领域的软件:List of RNA-Seq bioinformatics tools
这个链接有几百个生物信息学里面做alignment的软件:
甚至连常见的生物信息学数据库也有自己的源码包:例如NCBI,ensembl,UCSC
下面就是ENSEMBL数据库的:NGS数据比对工具持续收集
(记住,这些软件都是人家发表文章的,非常难,你一辈子能搞定一个就很了不起了,比如我,就搞了一下bowtie,也是一知半解的)
分享了所有的代码,实在是太方便了:Ensembl Project · GitHub
可以跟着这些代码学习编程:Ensembl/ensembl-pipeline · GitHub
它的官网的帮助文档也特别详细:Help & Documentation
你现在还缺资料吗?