Integration with single cell and TCGA bulk data
If using an external R installation (may not be necessary on Linux systems).
import os
os.environ['R_HOME'] = r'C:\Program Files\R\R-4.4.1'
import the packages
import os
import cytobulk as ct
import scanpy as sc
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
Load scRNA-seq and bulk data
Load the reference single cell data, e.g. HTAN MSK data [Download data]. and TCGA LUSC data [Download data].
sc_adata = sc.read_h5ad("C:/Users/wangxueying/project/CytoBulk/case/TCGA_LUSC/input/sub_HTAN_MSK.h5ad")
sc_adata_ori = sc_adata.copy()
bulk_adata = sc.read_h5ad("C:/Users/wangxueying/project/CytoBulk/case/TCGA_LUSC/input/TCGA_LUSC.h5ad")
The cell type information is be stored in sc_adata.obs['he_cell_type']
sc_adata.obs['he_cell_type'].value_counts()
Deconvolute bulk data with sc-RNA seq as the reference.
If you want to use the pretrained model, please download the folder, extract it, and set the path of the extracted folder as the out_dir parameter. [Download] This will help you skip the training steps.
deconv_result,deconv_adata = ct.tl.bulk_deconv(bulk_data = bulk_adata,
sc_adata = sc_adata,
annotation_key ="he_cell_type",
out_dir=r"C:\Users\wangxueying\project\CytoBulk\case\TCGA_LUSC\TCGA_LUSC_2000",
dataset_name="lusc",
different_source=True,
downsampling=True,
n_cell=2000)
deconv_result.head(5)
Mapping scRNA-seq to bulk data
If you want to use multithreading for mapping, you can set multiprocessing=True and specify the number of CPUs to use with the cpu_num parameter.
reconstructed_cell, reconstructed_adata = ct.tl.bulk_mapping(bulk_adata = deconv_adata,
sc_adata = sc_adata_ori,
out_dir="/data1/wangxueying/cytobulk/out/TCGA_LUSC_2000",
project="TCGA_LUSC",
n_cell=2000,
annotation_key='he_cell_type',
multiprocessing=False)
The matching relationship between single cells and bulk samples is stored in reconstructed_cell. The data assigned to the same bulk sample is aggregated, and the new expression values are stored in reconstructed_adata.layers['mapping_ori'], while the original expression values are stored in reconstructed_adata.X.
reconstructed_cell.head(5)
Visulization of marker gene expression similarity between original and reconstructed data
Load the marker gene data across cell types. [Download st data]
marker_df = pd.read_csv(r"C:\Users\wangxueying\project\CytoBulk\case\bulk_brca\marker_gene_symbol.txt", sep="\t",index_col=0)
marker_df.head(5)
ct.pl.gene_similarity(reconstructed_adata, marker_df)