通常情况下,我们的原始测序数据会上传到NCBI的SRA数据库,自然就在EBI备份了。需要熟悉GEO和SRA数据库:
一般来说,NCBI数据库提供的prefetch命令下载sra文件速度太慢,可以参考:使用ebi数据库直接下载fastq测序数据 , 需要自行配置好,然后去EBI里面搜索到的 fq.txt 路径文件:
脚本如下:
# conda activate download
# 自己搭建好 download 这个 conda 的小环境哦。
cat fq.txt |while read id
do
ascp -QT -l 300m -P33001 \
-i ~/miniconda3/envs/download/etc/asperaweb_id_dsa.openssh \
era-fasp@$id .
done
# nohup bash step1-aspera.sh 1>step1-aspera.log 2>&1 &
这个脚本会根据你在EBI里面搜索到的 fq.txt 路径文件,来批量下载fastq测序数据文件。
需要授权才能访问的数据库
但是有些时候,大家并不会选择完全开放自己的数据库,比如上传到
- https://www.ncbi.nlm.nih.gov/gap/
- EBI的需要授权的数据库
比如文章Cancer Cell. 2021 May 10 .,标题是:《Progressive immune dysfunction with advancing disease stage in renal cell carcinoma》,我看了看他们的数据在 :https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002252.v1.p1
是3个测序技术:
- Single-cell RNA sequencing (scRNA-seq)
- TCR sequencing (scTCR-seq)
- Whole exome sequencing
13个病人:
- 4 with early stage disease (stage I/II),
- 4 with locally advanced disease (stage III),
- 5 with advanced/metastatic disease (stage IV)
仔细看了看页面信息,虽然是在dbgap数据库,不便公开,需要申请。其实已经有两个数据申请的要求被满足了
第一个是:
Requestor: Turajlic, Samra
Affiliation: FRANCIS CRICK INSTITUTE, LTD
Project: Meta-analysis of single-cell sequencing data in clear cell renal cell carcinoma
Date of approval: 2021-05-27
Request status: approved
Research use statements (Hide)
第二个是:
Requestor: Van Allen, Eliezer
Affiliation: DANA-FARBER CANCER INST
Project: Whole exome and transcriptome predictors of response to immune checkpoint therapy for advanced cancers
Date of approval: 2021-05-11
Request status: approved
Research use statements (Hide)
绝大部分情况下无需申请原始数据啦
因为这样的ccRCC的单细胞文献已有十几篇啦,其它数据集都公布了原始测序数据,并不需要在这一棵树上吊死哦!
另外,其实这个文章自己也有提供表达量矩阵,不过并没有在GEO数据库,而是直接放在了文章附件:
- supplementary Data S1: Data S1. ScRNA-seq raw count matrix (part 1 of 2), after quality control filtering, with genes as rows and cell barcodes as columns, related to Figure 1–6, S1–3, and S5.
NIHMS1692222-supplement-supplementary_Data_S1.zip (143M)
GUID: 217E8B40-EB49-4FF5-AEF5-57BBBA4DAE61
- supplementary Data S2: Data S2. ScRNA-seq raw count matrix (part 2 of 2), after quality control filtering, with genes as rows and cell barcodes as columns, related to Figure 1–6, S1–3, and S5.
NIHMS1692222-supplement-supplementary_Data_S2.csv (1.7G)
GUID: 34477B69-0F73-4D9A-B926-66981E1D5D4A