3500个TNBC单细胞转录组数据重处理

文章：A Targetable EGFR-Dependent Tumor-Initiating Program in Breast Cancer , 因为bulk测序无法解决问题，所以作者选择了单细胞转录组测序策略：

To understand functional properties associated with heterogeneous EGFR expression in an unbiased manner, single cell RNA-seq was performed on freshly dissociated cells from the PDX (3,483 cells, with an average of 40,564 unique molecular identifiers (UMIs) and 5,146 genes detected per cell)

数据都在SRA数据库里面，如下：https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP110989

Run	Library name	MBases	MBytes	Experiment	Instrument
SRR5799776	PDX1735_run4105_lane006	33,751	18,184	SRX2979241	Illumina HiSeq 4000
SRR5799775	PDX1735_run4143_lane001	19,420	9,534	SRX2979242	Illumina HiSeq 2500
SRR5799774	PDX1735_run4143_lane002	19,408	9,548	SRX2979243	Illumina HiSeq 2500

但是作者并没有给表达矩阵，所以只能自行下载原始数据进行单细胞转录组全流程处理。

mkdir -p ~/data/public/TNBC/
cd ~/data/public/TNBC/
nohup wget -c ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR579/SRR5799774/SRR5799774.sra & 
nohup wget -c ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR579/SRR5799776/SRR5799776.sra & 
nohup wget -c ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR579/SRR5799775/SRR5799775.sra &

nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --gzip --split-3 SRR5799774.sra &
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --gzip --split-3 SRR5799775.sra &
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --gzip --split-3 SRR5799776.sra &

下载并且解压后是：

1.7G Jan 22 23:34 SRR5799774_1.fastq.gz
 13G Jan 22 23:34 SRR5799774_2.fastq.gz
9.4G Jan 22 17:40 SRR5799774.sra
1.7G Jan 22 23:33 SRR5799775_1.fastq.gz
 13G Jan 22 23:33 SRR5799775_2.fastq.gz
9.4G Jan 22 17:31 SRR5799775.sra
2.9G Jan 23 00:55 SRR5799776_1.fastq.gz
 24G Jan 23 00:55 SRR5799776_2.fastq.gz
 18G Jan 22 18:25 SRR5799776.sra

可以看到左右端数据文件大小差别很大，因为这个不是普通的双端测序。

需要在作者的文章里面找到测序的描述，这篇文章的补充材料有介绍：

26 bp Read1, 8 bp I7 Index, 0 bp I5 Index and 98 bp Read2.

测序数据量是：a total of 717,982,475 reads, and 179,137 reads per single-cell

因为是 10x Genomics方法做的单细胞转录组数据，所以需要使用他们发表的工具来处理：Cell Ranger ，需要简单注册才能下载安装，我下载了一个测试数据，发现：

├── [237M] neurons_900_S1_L001_I1_001.fastq.gz
├── [642M] neurons_900_S1_L001_R1_001.fastq.gz
├── [1.8G] neurons_900_S1_L001_R2_001.fastq.gz
├── [238M] neurons_900_S1_L002_I1_001.fastq.gz
├── [646M] neurons_900_S1_L002_R1_001.fastq.gz
└── [1.8G] neurons_900_S1_L002_R2_001.fastq.gz

可以看到左右端测序数据大小不一致，而且每次测序是有3个数据，因为26bp read1 (16bp Chromium barcode and 10bp UMI), 98bp read2 (transcript), and 8bp I7 sample barcode ，只有reads2的fastq里面是真正的转录本序列，另外的两个文件都是barcode！可以直接用 Cell Ranger 来做分析，代码如下：

/home/jianmingzeng/biosoft/10xgenomic/cellranger-2.1.0/cellranger count --id=neurons \
--localcores 5 \
--transcriptome=/home/jianmingzeng/biosoft/10xgenomic/db/refdata-cellranger-mm10-1.2.0 \
--fastqs=/home/jianmingzeng/data/public/10x/neurons_900_fastqs \
--sample=neurons \
--expect-cells=900

但是作者上传的数据缺失了关键信息，我写信给10x genomics公司的人咨询了这件事

I just read a paper: A Targetable EGFR-Dependent Tumor-Initiating Program in Breast Cancer
and they choose 10x genomics for scRNA-seq, and upload the raw data into SRA database.

While I’ve download them, there should be 26 bp Read1, 8 bp I7 Index, 0 bp I5 Index and 98 bp Read2.

But I just found the 8 bp in fq1, and 98bp in fq2, the key information just lost , which means I can’t use the Cell Ranger to process them.

Any help ?

公司回复我说，如果缺失barcode信息，这个数据是没办法处理的。

Michael Campbell (10x Genomics)Jan 26, 07:03 PST Hi Jianming,

That’s right if you don’t have the 26bp read with the 10x barcode and UMI in it you can’t use Cell Ranger, or any other tool for that matter because there is no way to related the second read to the cell it came from. I would contact the corresponding author to see what happened to the R1 read. If you want, you can send me the SRR number and I can have a look to see if the R1 read is buried somewhere.

Best,
Mike

然后我给出了文章以及SRA号，公司的任又检查了一遍，的确是作者的失误。

Hi Jainming,

It looks like they uploaded the index read as read 1 instead of the read with the barcode. It’s not analyzable in this format.

Best,
Mike

生信菜鸟团

欢迎去论坛biotrainee.com留言参与讨论，或者关注同名微信公众号biotrainee

3500个TNBC单细胞转录组数据重处理

2026年6月
一	二	三	四	五	六	日
« 九
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30