H&E cell prediction and integration with scRNA-seq on TCGA data

If using an external R installation (may not be necessary on Linux systems).

import os
os.environ['R_HOME'] = r'C:\Program Files\R\R-4.4.1'

import the packages

import os
import cytobulk as ct
import scanpy as sc
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

Load sc-RNA seq data and ligand receptor data (optional)

If you want to perform cell segmentation on HE images and integrate it with single-cell data, please complete the following steps. Otherwise, you can skip this section.

Load the reference single cell data, e.g. HTAN MSK data. [Download data]

sc_adata = sc.read_h5ad("C:/Users/wangxueying/project/CytoBulk/case/he_image/svs/TCGA_LUSC/sub_HTAN_MSK.h5ad")

Please ensure that your sc_adata contains at least one of the following six cell types: lymphocytes, epithelial cells, plasma cells, neutrophils, eosinophils, and connective tissue

sc_adata.obs['he_cell_type']

Cell
RU1311A_T_1_165945547864806                 lymphocytes
RU1181B_169649541863334                      epithelial
RU1108a_RPMI_164761076713396                 epithelial
RU1145_133982151621558                      lymphocytes
RU1145_170180373265117                       epithelial
                                               ...     
RU1145_161890937236718                      lymphocytes
RU1108a_RPMI_160785132370275                 epithelial
RU1108a_Bambanker_Frozen_231897696155998     epithelial
RU1181B_236168014327141                      epithelial
RU1145_120772933872502                       epithelial
Name: he_cell_type, Length: 31165, dtype: category
Categories (5, object): ['connective tissue', 'epithelial', 'lymphocytes', 'neutrophils', 'plasma cells']

Load the reference ligand receptor data. e.g. CellChatDB ligand-receptor database. [Download data]

lr_data = pd.read_csv("C:/Users/wangxueying/project/CytoBulk/case/he_image/svs/input/lrpairs.csv")

We will use the CellChatDB ligand-receptor database here. You can also use any lr pair data, as long as it follows the format below (the ligand and receptor columns are required).

lr_data.head(5)

	ligand	receptor	species
0	SEMA3F	PLXNA3	Human
1	SEMA3F	PLXNA1	Human
2	SEMA3F	NRP1	Human
3	SEMA3F	NRP2	Human
4	CX3CL1	CX3CR1	Human

Preprocess H&E image

For the H&E image used for prediction, here we take the H&E image of the TCGA-56-8626 sample as an example. The input H&E image format is .svs. First, we crop a region with a width of 2240 and a height of 2240, centered at center x 10000 and center y 11200. This region is then split into smaller sub-images, each with a width of 224 and a height of 224. [Download data]

ct.pp.process_svs_image(
    svs_path = r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\image\TCGA-37-4132.svs",
    output_dir = r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\input\demo_split_test",
    crop_size=224, magnification=1,center_x=10000,center_y=11200,fold_width=10,fold_height=10)

Original image size: 26001x21271
Image center: (10000, 11200)
Crop region: Start=(8880, 10080), Size=(2240, 2240)

No description has been provided for this image

Enlarged image size: (2240, 2240)

If your image is in another format, you can split it yourself to generate a folder containing sub-images. Each subfolder should be named after the starting x and y coordinates of the cropped images. Inside each subfolder, the split images should be stored.

Each subfolder name corresponds to the starting x and y coordinates (e.g., 0_0, 0_224, etc.).
Inside each subfolder, the cropped images are stored, such as 0.jpg.

Predict cell type labels from H&E image

If you only want to perform cell segmentation and cell type prediction within the H&E image without performing single-cell mapping, you can directly use the following function.

cell_coordinates = ct.tl.he_mapping(image_dir=r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\input\demo_split",
                                    out_dir = r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\out\demo",
                                    project = "demo",
                                    lr_data = None,
                                    sc_adata = None,
                                    annotation_key="he_cell_type",
                                    k_neighbor=30,
                                    alpha=0.5,
                                    mapping_sc=False)

File already exists: c:\Users\wangxueying\anaconda\envs\cytobulk\lib\site-packages\cytobulk\tools\model\pretrained_models\DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth
Model loaded, unexpected keys: []
Generating segmentation and classification maps for sample images
All visual results saved
Combine results
Skipping invalid file name: combinded_cent.txt
Data successfully written to C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\out\demo\combinded_cent.txt
Save file done

If you encounter the following error while running the ct.tl.he_mapping function: _pickle.UnpicklingError: invalid load key, '<'.

This error occurs because the pretrained model file was not fully downloaded, resulting in a corrupted or incomplete file. To resolve this issue, follow the steps below to manually download the model file and place it in the correct location. [Download model]

The predicted cell types and their corresponding coordinates are stored in the following format. The file is located in the out_dir directory, and its name is cell_coordinates.txt.

cell_coordinates.head(5)

	data_set	x	y	cell_type
0	demo	17	1	Epithelial Cells
1	demo	33	2	Epithelial Cells
2	demo	129	4	Epithelial Cells
3	demo	34	12	Epithelial Cells
4	demo	218	12	Epithelial Cells

Next, use the built-in function to visualize the results.

ct.pl.he_cell_type(cell_coordinates,out_dir=r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\out\demo")

Mapping scRNA-seq data on H&E image

If you want to perform cell segmentation and cell type prediction within the H&E image, as well as conduct single-cell mapping, please refer to the following function.

The predicted cell types from the H&E image include: Eosinophils, Plasma Cells, Connective Tissue, Epithelial Cells, Neutrophils, and Lymphocytes. First, ensure that your dataset's cell type labels include at least one of these types. You can use the following code as a reference to map your existing cell type labels to the required format:

mapping = {
    'plasma cells': 'Plasma Cells',
    'connective tissue': 'Connective Tissue',
    'epithelial': 'Epithelial Cells',
    'neutrophils': 'Neutrophils',
    'lymphocytes': 'Lymphocytes'}

sc_adata.obs['he_cell_type'] = sc_adata.obs['he_cell_type'].replace(mapping)

cell_coordinates,df,filtered_adata = ct.tl.he_mapping(image_dir=r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\input\demo_split",
                                            out_dir = r"C:\Users\wangxueying\project\CytoBulk\case\he_image\svs\out\demo",
                                            project = "demo",
                                            lr_data = lr_data,
                                            sc_adata = sc_adata,
                                            annotation_key="he_cell_type",
                                            k_neighbor=30,
                                            alpha=0.5,
                                            mapping_sc=True)

File already exists: c:\Users\wangxueying\anaconda\envs\cytobulk\lib\site-packages\cytobulk\tools\model\pretrained_models\DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth
print(f'{out_dir}/combinded_cent.txt already exists, skipping prediction.')
preprocessing of single cell data
Common cell types: {'Lymphocytes', 'Connective Tissue', 'Epithelial Cells', 'Plasma Cells', 'Neutrophils'}
loading graph for H&E image...
loading graph for single cell data with LR affinity...
sample single cells according to predicted label
compute LR affinity
compute cost matrix
optimal transport...
Time to finish mapping: 201.66 seconds
=========================================================================================================================================
build matching file...

Here, df and filtered_adata.obs store the single-cell cell IDs and their assigned coordinates.

filtered_adata.obs.head(5)

	location	cell	x	y	he_cell_type	cell_type
cell_id
RU426B_164640655824627	0	2453	17	1	Epithelial Cells	Epithelial Cells
RU1229A_Frozen_228044332743404	1	1512	33	2	Epithelial Cells	Epithelial Cells
RU1066_129565685888750	2	2273	129	4	Epithelial Cells	Epithelial Cells
RU426B_166407915399029	3	1954	34	12	Epithelial Cells	Epithelial Cells
RU1066_126293994555309	4	1697	218	12	Epithelial Cells	Epithelial Cells