H&E cell prediction and integration with scRNA-seq on TCGA data
If using an external R installation (may not be necessary on Linux systems).
import os
os.environ['R_HOME'] = r'C:\Program Files\R\R-4.4.1'
import the packages
import os
import cytobulk as ct
import scanpy as sc
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
Load sc-RNA seq data and ligand receptor data (optional)
If you want to perform cell segmentation on HE images and integrate it with single-cell data, please complete the following steps. Otherwise, you can skip this section.
Load the reference single cell data, e.g. HTAN MSK data. [Download data]
sc_adata = sc.read_h5ad("C:/Users/wangxueying/project/CytoBulk/case/he_image/svs/TCGA_LUSC/sub_HTAN_MSK.h5ad")
Please ensure that your sc_adata contains at least one of the following six cell types: lymphocytes, epithelial cells, plasma cells, neutrophils, eosinophils, and connective tissue
sc_adata.obs['he_cell_type']
Load the reference ligand receptor data. e.g. CellChatDB ligand-receptor database. [Download data]
lr_data = pd.read_csv("C:/Users/wangxueying/project/CytoBulk/case/he_image/svs/input/lrpairs.csv")
We will use the CellChatDB ligand-receptor database here. You can also use any lr pair data, as long as it follows the format below (the ligand and receptor columns are required).
lr_data.head(5)
Preprocess H&E image
For the H&E image used for prediction, here we take the H&E image of the TCGA-56-8626 sample as an example. The input H&E image format is .svs. First, we crop a region with a width of 2240 and a height of 2240, centered at center x 10000 and center y 11200. This region is then split into smaller sub-images, each with a width of 224 and a height of 224. [Download data]
ct.pp.process_svs_image(
svs_path = r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\image\TCGA-37-4132.svs",
output_dir = r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\input\demo_split_test",
crop_size=224, magnification=1,center_x=10000,center_y=11200,fold_width=10,fold_height=10)
If your image is in another format, you can split it yourself to generate a folder containing sub-images. Each subfolder should be named after the starting x and y coordinates of the cropped images. Inside each subfolder, the split images should be stored.
-
Each subfolder name corresponds to the starting x and y coordinates (e.g., 0_0, 0_224, etc.).
-
Inside each subfolder, the cropped images are stored, such as 0.jpg.
Predict cell type labels from H&E image
If you only want to perform cell segmentation and cell type prediction within the H&E image without performing single-cell mapping, you can directly use the following function.
cell_coordinates = ct.tl.he_mapping(image_dir=r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\input\demo_split",
out_dir = r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\out\demo",
project = "demo",
lr_data = None,
sc_adata = None,
annotation_key="he_cell_type",
k_neighbor=30,
alpha=0.5,
mapping_sc=False)
If you encounter the following error while running the ct.tl.he_mapping function: _pickle.UnpicklingError: invalid load key, '<'.
This error occurs because the pretrained model file was not fully downloaded, resulting in a corrupted or incomplete file. To resolve this issue, follow the steps below to manually download the model file and place it in the correct location. [Download model]
The predicted cell types and their corresponding coordinates are stored in the following format. The file is located in the out_dir directory, and its name is cell_coordinates.txt.
cell_coordinates.head(5)
Next, use the built-in function to visualize the results.
ct.pl.he_cell_type(cell_coordinates,out_dir=r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\out\demo")
Mapping scRNA-seq data on H&E image
If you want to perform cell segmentation and cell type prediction within the H&E image, as well as conduct single-cell mapping, please refer to the following function.
The predicted cell types from the H&E image include: Eosinophils, Plasma Cells, Connective Tissue, Epithelial Cells, Neutrophils, and Lymphocytes. First, ensure that your dataset's cell type labels include at least one of these types. You can use the following code as a reference to map your existing cell type labels to the required format:
mapping = {
'plasma cells': 'Plasma Cells',
'connective tissue': 'Connective Tissue',
'epithelial': 'Epithelial Cells',
'neutrophils': 'Neutrophils',
'lymphocytes': 'Lymphocytes'}
sc_adata.obs['he_cell_type'] = sc_adata.obs['he_cell_type'].replace(mapping)
cell_coordinates,df,filtered_adata = ct.tl.he_mapping(image_dir=r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\input\demo_split",
out_dir = r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\out\demo",
project = "demo",
lr_data = lr_data,
sc_adata = sc_adata,
annotation_key="he_cell_type",
k_neighbor=30,
alpha=0.5,
mapping_sc=True)
Here, df and filtered_adata.obs store the single-cell cell IDs and their assigned coordinates.
filtered_adata.obs.head(5)