Integration with single cell and TCGA bulk data

If using an external R installation (may not be necessary on Linux systems).

import os
os.environ['R_HOME'] = r'C:\Program Files\R\R-4.4.1'

import the packages

import os
import cytobulk as ct
import scanpy as sc
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

Load scRNA-seq and bulk data

Load the reference single cell data, e.g. HTAN MSK data [Download data]. and TCGA LUSC data [Download data].

sc_adata = sc.read_h5ad("C:/Users/wangxueying/project/CytoBulk/case/TCGA_LUSC/input/sub_HTAN_MSK.h5ad")
sc_adata_ori = sc_adata.copy()
bulk_adata = sc.read_h5ad("C:/Users/wangxueying/project/CytoBulk/case/TCGA_LUSC/input/TCGA_LUSC.h5ad")

The cell type information is be stored in sc_adata.obs['he_cell_type']

sc_adata.obs['he_cell_type'].value_counts()

he_cell_type
epithelial           25988
lymphocytes           4535
connective tissue      565
neutrophils             69
plasma cells             8
Name: count, dtype: int64

Deconvolute bulk data with sc-RNA seq as the reference.

If you want to use the pretrained model, please download the folder, extract it, and set the path of the extracted folder as the out_dir parameter. [Download] This will help you skip the training steps.

deconv_result,deconv_adata = ct.tl.bulk_deconv(bulk_data = bulk_adata,
                                                sc_adata = sc_adata,
                                                annotation_key ="he_cell_type",
                                                out_dir=r"C:\Users\wangxueying\project\CytoBulk\case\TCGA_LUSC\TCGA_LUSC_2000",
                                                dataset_name="lusc",
                                                different_source=True,
                                                downsampling=True,
                                                n_cell=2000)

deconv_result.head(5)

	connective tissue	epithelial	lymphocytes	neutrophils	plasma cells
TCGA-18-3421	0.273443	0.176402	0.199808	0.204587	0.130384
TCGA-37-4133	0.282663	0.193125	0.156352	0.146849	0.171124
TCGA-L3-A524	0.337083	0.295797	0.119766	0.124372	0.121685
TCGA-56-A4ZK	0.322920	0.144941	0.182277	0.145558	0.146625
TCGA-39-5027	0.241746	0.491989	0.093591	0.133225	0.109625

Mapping scRNA-seq to bulk data

If you want to use multithreading for mapping, you can set multiprocessing=True and specify the number of CPUs to use with the cpu_num parameter.

reconstructed_cell, reconstructed_adata = ct.tl.bulk_mapping(bulk_adata = deconv_adata,
                                                            sc_adata = sc_adata_ori,
                                                            out_dir="/data1/wangxueying/cytobulk/out/TCGA_LUSC_2000",
                                                            project="TCGA_LUSC",
                                                            n_cell=2000,
                                                            annotation_key='he_cell_type',
                                                            multiprocessing=False)

The matching relationship between single cells and bulk samples is stored in reconstructed_cell. The data assigned to the same bulk sample is aggregated, and the new expression values are stored in reconstructed_adata.layers['mapping_ori'], while the original expression values are stored in reconstructed_adata.X.

reconstructed_cell.head(5)

	sample_id	cell_id
0	TCGA-18-3421	RU426B_197057083469109
1	TCGA-18-3421	RU426B_135693665261491
2	TCGA-18-3421	RU1144_T_236768069871899
3	TCGA-18-3421	RU1311A_T_1_130539716504350
4	TCGA-18-3421	RU1195A_236107999337252

Visulization of marker gene expression similarity between original and reconstructed data

Load the marker gene data across cell types. [Download st data]

marker_df = pd.read_csv(r"C:\Users\wangxueying\project\CytoBulk\case\bulk_brca\marker_gene_symbol.txt", sep="\t",index_col=0)

marker_df.head(5)

	score	pvalue	adj_pvalue	cell_type	ensg_id	gene_symbol
gene
ENSG00000142089	13.354671	5.980232e-32	1.183209e-30	connective tissue	ENSG00000142089	IFITM3
ENSG00000142089	-8.668920	5.285041e-18	1.676378e-17	epithelial	ENSG00000142089	IFITM3
ENSG00000142089	-3.913898	9.135025e-05	1.656657e-04	lymphocytes	ENSG00000142089	IFITM3
ENSG00000142089	-2.615495	1.768798e-02	2.432355e-02	neutrophils	ENSG00000142089	IFITM3
ENSG00000142089	-0.253202	8.101615e-01	8.899934e-01	plasma cells	ENSG00000142089	IFITM3

ct.pl.gene_similarity(reconstructed_adata, marker_df)

Processing cell_type: connective tissue
Processing cell_type: epithelial
Processing cell_type: lymphocytes
Processing cell_type: neutrophils
Processing cell_type: plasma cells

<module 'matplotlib.pyplot' from 'c:\\Users\\wangxueying\\anaconda\\envs\\cytobulk\\lib\\site-packages\\matplotlib\\pyplot.py'>

No description has been provided for this image