英语为主的生物信息学交流平台大家有推荐的吗

前面提到了:我教程的第一个外国读者,不少海外读者开始follow我的教程了。诚然,自己这些年写了一万多篇教程了,很多其实可以也值得输出到海外,但是限于母语的惯性,并不想耗费时间写英文版教程。好在我有很多学徒,招募一些优秀者组成翻译小分队,帮我把《生信技能树》公众号一些阅读量比较好的教程翻译后输出到英语为主的生物信息学交流平台。

希望大家可以推荐一下,自己看到的值得推荐的英语为主的生物信息学交流平台,我们去注册账号开始分享!

然后,分享给大家第一批翻译成果,来自于笔名:Biolily的小姐姐:

学徒翻译的第一个教程

原文见《生信技能树》公众号:根据坐标在基因组上面拿到碱基序列来设计引物

翻译后的标题是: Get the base sequence according to the coordinates on the genome to design primers

Generally speaking, you will get the mutation site information via DNA sequencing.Whether it is SNV or INDEL, it is just a coordinate on the genome. The results of high-throughput sequencing usually demand experimental verification,mostly Sanger sequencing.You need to design primers to capture the sequence information near the mutation site to be sure. And the high-throughput sequencing results verified by sanger sequencing will not be doubted by any reviewer!!

We can easily query its sequence information with the help of various web tools to deal with one or two sites.while, the number of results of high-throughput sequencing is often tens of thousands. Even though you want to save cost, generally,you need to pick up about 100 sites and design primers for sanger sequencing. The workload of each web page query is really a backbreaking work. At this time, you can use the code to achieve batch query.

First we use R language to simulate 22 mutation sites:

Just some simple code, here we select a site on each of the 22 chromosomesrandomly, as a demonstration of the program:

> pos=data.frame(chr=paste0('chr',1:22),start=sample(1:10000000,22))
> pos
 chr start
1 chr1 2022626
2 chr2 696733
3 chr3 3250387
4 chr4 7673854
5 chr5 5408537
6 chr6 9719502
7 chr7 6581990
8 chr8 9601594
9 chr9 4787975
10 chr10 3528978
11 chr11 5885445
12 chr12 4356111
13 chr13 9586571
14 chr14 5893113
15 chr15 2299890
16 chr16 5854945
17 chr17 3117896
18 chr18 1789465
19 chr19 7853784
20 chr20 6409488
21 chr21 3040456
22 chr22 8896738

Then use the BSgenome:: use getSeq function to extract the sequence according to the site

The reference genome sequence comes from the BSgenome.Hsapiens.UCSC.hg38 package, which is huge!Remember to switch to the mirror high-speed download when you download and install it!

options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor/")
options("repos" = c(CRAN="http://mirrors.cloud.tencent.com/CRAN/"))
options(download.file.method ='libcurl')
options(url.method='libcurl')
BiocManager::install("BSgenome.Hsapiens.UCSC.hg38",ask = F,update = F)

After all is done, you can directly use the BSgenome::getSeq function.The entire code is as follows:

pos=data.frame(chr=paste0('chr',1:22),start=sample(1:10000000,22))
pos
library("BSgenome.Hsapiens.UCSC.hg38")
library("GenomicRanges")
# In reality, it is actually reading your mutation coordinate file:
# pos=read.table('pos.txt')
# head(pos)
# 400bp before and after the mutation site for primer design
pos1=GRanges(seqnames=pos[,1], ranges=IRanges(start=pos[,2]-400,end=pos[,2]))
pos2=GRanges(seqnames=pos[,1], ranges=IRanges(start=pos[,2],end=pos[,2]))
pos3=GRanges(seqnames=pos[,1], ranges=IRanges(start=pos[,2]+1,end=pos[,2]+401))
seq1 = BSgenome::getSeq(BSgenome.Hsapiens.UCSC.hg38, pos1)
seq2 = BSgenome::getSeq(BSgenome.Hsapiens.UCSC.hg38, pos2)
seq3 = BSgenome::getSeq(BSgenome.Hsapiens.UCSC.hg38, pos3)

The output can be fasta files or txt files,usually txt files. For most biologists without a bioinformatics background actually don’t understand another one.

#
# names(seq) = paste0("SEQUENCE_", seq_along(seq))
# Biostrings::writeXStringSet(seq, "my.fasta")

tmp=cbind(as.data.frame(pos),
 as.data.frame(as.character(seq1)),
 as.data.frame(as.character(seq2)),
 as.data.frame(as.character(seq3)))
write.table(tmp,file ='myFastq.txt',
 row.names = F,quote = F,col.names = F)

In the previous code, extracting the 400bp before and after the mutation site for primer design is not very convenient to be shown as a tutorial, so I modified it to 4bp before and after, as shown below:

chr1 2022626 CCTCA A CTAG
chr2 696733 TCCCT T AGGT
chr3 3250387 CTACT T ACAC
chr4 7673854 CCACC C ACCC
chr5 5408537 GTAAA A ACTA
chr6 9719502 ATATT T AATT
chr7 6581990 TGGTT T GGCC
chr8 9601594 AACAC C CTGA
chr9 4787975 AAAGC C AAAC
chr10 3528978 TCATA A TCAC
chr11 5885445 AGATT T AATG
chr12 4356111 GTGGA A GAGC
chr13 9586571 NNNNN N NNNN
chr14 5893113 NNNNN N NNNN
chr15 2299890 NNNNN N NNNN
chr16 5854945 ATTGT T GGTT
chr17 3117896 TCAAA A CCCC
chr18 1789465 TTCTT T TACA
chr19 7853784 GGGAC C CGCC
chr20 6409488 CCAGG G GCTT
chr21 3040456 NNNNN N NNNN
chr22 8896738 NNNNN N NNNN

You will find thatthe 4bp base sequence of the upstream and downstream of each mutation sitehas been extracted. You can design primers based on these sequences for sanger sequencing verification.

bioconductor is super worth learning!!!

More relevant links:

学徒翻译的第二个教程

原文见《生信技能树》公众号:在Linux服务器里面安装GISTIC软件

翻译后的标题是:Install GISTIC software in Linux server

About five or six years ago, I wrote a tutorial on the installation and use of GISTIC software aimed at copy number chips like SNP6.0. The GISTIC software is frequently used in the TCGA project. The purpose of using this software is very simple, that is, you have studied a lot of cancer samples, through microarray or tumor exome sequencing + get the copy number change information of each sample , generally segment results, which can be interpreted as CNV regions, you need to use GISTIC to analyze the samples together, search for somatic CNV, and annotate gene information.

There are two difficulties in the use of GISTIC software. One is to install the matlab working environment under linux, and the other is how to make input files.

Download software offline installation package from official website
The official website : ftp://ftp.broadinstitute.org/pub/GISTIC2.0/

Download this file (GISTIC2023.tar.gz):

GISTIC2023.tar.gz 596 MB

Download and unzip, the code is as followed:

mkdir -p $HOME/biosoft/GISTIC
cd $HOME/biosoft/GISTIC
wget ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTIC_2_0_23.tar.gz
tar zxvf GISTIC2023.tar.gz

After all is done, the folder structure is as follows:

Install MCRInstaller

Because GISTIC software is a MATLAB program, MCRInstaller is required to run in the Linux environment. After all, matlab is a fee-based software, and it has an interface. Although all those engaged in bioinformatics use R and Linux instead of MATLAB, many high-end units, such as the famous Broad Institute, still use MATLAB, so the programs they develop will also be released in the form of MATLAB code. But considering that most researchers can’t use matlab or don’t know how to use it, so we install the matlab runtime environment in the linux system to solve this problem. We can still use the matlab program written by others as a script under the linux command line. Come and run!

There is an MCRInstaller folder in the GISTIC software offline package that we downloaded earlier, and there is a MCRInstaller compressed package in it that can be decompressed and installed.

cd MCRInstaller
unzip MCRInstaller.zip
chmod 744 installerinput.txt

Because it is Linux, install the software in quiet mode, there is no interactive animation process with the mouse. You need to pay attention to the java environment, and then you need to understand the installerinput.txt file in advance.

destinationFolder=$HOME/biosoft/GISTIC/MATLABCompilerRuntime
agreeToLicense=yes
mode=silent
conda activate qc # My java is in this conda environment
./install -mode silent -agreeToLicense yes -destinationFolder $HOME/biosoft/GISTIC/MATLABCompilerRuntime

This step requires a good understanding of the installerinput.txt file, which is actually quite difficult.

There will be a simple log during the installation process. You need to pay attention to it. Finally, the following words appear to indicate a successful installation:

(Oct 08, 2020 16:29:36) Exiting with status 0
(Oct 08, 2020 16:29:36) End-Successful.
Finished

One-stop use of GISTIC software

First need to modify the GISTIC command;

$ cat gistic2
#!/bin/sh
## set MCR environment and launch GISTIC executable

## NOTE: change the line below if you have installed the Matlab MCR in an alternative location
MCRROOT=$HOME/biosoft/GISTIC/MATLABCompilerRuntime
MCRVER=v83

echo Setting Matlab MCR root to $MCRROOT

set up environment variables
LDLIBRARYPATH=$MCRROOT/$MCRVER/runtime/glnxa64:$LDLIBRARYPATH
LDLIBRARYPATH=$MCRROOT/$MCRVER/bin/glnxa64:$LDLIBRARYPATH
LDLIBRARYPATH=$MCRROOT/$MCRVER/sys/os/glnxa64:$LDLIBRARYPATH
export LDLIBRARYPATH
XAPPLRESDIR=$MCRROOT/$MCRVER/MATLABComponentRuntime/v83/X11/app-defaults
export XAPPLRESDIR

launch GISTIC executable
$HOME/biosoft/GISTIC/gpgistic2fromseg $@

There are very few modifications, just the variable MCRROOT and the full path call gpgistic2fromseg command.

Every subsequent run is super simple:

conda activate qc
$HOME/biosoft/GISTIC/gistic2 -h

Each project will generate a segment result file. The complete run command is:

basedir=`pwd`/gistic2results
mkdir -p $basedir
echo --- running GISTIC ---
segfile=pwd/cnvkitfinalcall.seg
refgenefile=$HOME/biosoft/GISTIC/refgenefiles/hg38.UCSC.add_miR.160920.refgene.mat
$HOME/biosoft/GISTIC/gistic2 -b $basedir -seg $segfile -refgene $refgenefile \
-genegistic 1 -smallmem 1 -broad 1 -brlen 0.5 -conf 0.90 -armpeel 1 -savegene 1 -gcm extreme

Most of the parameters are actually default values, and the sample code is copied as it is.

Reference tutorial:

如果你也想加入我们的知识分享团队

还等什么呢,赶快行动起来吧! 发邮件(jmzeng1314@163.com)给生信技能树创始人jimmy就有惊喜哦!当然了,不能是辣鸡或者骚扰邮件啦,带上自己的简历和想学习交流的诚心吧!

Comments are closed.