Integration with single cell, spatial and in situ analysis

If using an external R installation (may not be necessary on Linux systems).

import os
os.environ['R_HOME'] = r'C:\Program Files\R\R-4.4.1'

import the packages

import os
import cytobulk as ct
import scanpy as sc
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

Here we use the single-cell, spatial, and in situ data from Sarah E. B. Taylor et al. and utilize CytoBulk for integrating the datasets. [Download sc data] [Download st data]

Load ST and sc-RNA seq data

sc_adata = sc.read_h5ad("C:/Users/wangxueying/project/CytoBulk/case/10x/input/sc_adata.h5ad")
sc_ori_adata = sc_adata.copy()
st_adata = sc.read_h5ad("C:/Users/wangxueying/project/CytoBulk/case/10x/input/st_adata_sub_3.h5ad")

The cell type information is stored in sc_adata.obs['cell_type]

sc_adata.obs['cell_type'].value_counts()

cell_type
Invasive Tumor             4559
Macrophages 1              2964
CD4+ T Cells               2899
Stromal                    2609
DCIS 2                     2159
DCIS 1                     1862
CD8+ T Cells               1843
B Cells                    1463
Prolif Invasive Tumor      1329
Myoepi ACTA2+              1235
Endothelial                1055
T Cell & Tumor Hybrid      1003
Macrophages 2               760
Myoepi KRT15+               604
Stromal & T Cell Hybrid     426
Perivascular-Like           285
IRF7+ DCs                   210
LAMP3+ DCs                  103
Mast Cells                   92
Name: count, dtype: int64

Deconvolute ST data with sc-RNA seq as the reference.

If you want to use the pretrained model, please download the folder, extract it, and set the path of the extracted folder as the out_dir parameter. [Download] This will help you skip the training steps.

deconv_result,deconv_adata = ct.tl.st_deconv(st_adata = st_adata,
                            sc_adata = sc_adata,
                            annotation_key ="cell_type",
                            out_dir="C:/Users/wangxueying/project/CytoBulk/case/10x/sub3/demo",
                            dataset_name="BRCA_10X",
                            different_source=True,
                            n_cell=10)

The scaled fraction values (where the sum of each row equals 1) are stored in deconv_result and deconv_h5ad.uns['deconv']. The raw data is stored in the out_dir/out_put folder with the file name project_name_prediction_frac.csv.

deconv_result.iloc[:5, :7]

	DCIS 2	Endothelial
AACCACTGCCATAGCC-1	0.172999	0.000000
AACCGCCAGACTACTT-1	0.000000	0.000000
AACGAAGCGTGGAAGT-1	0.000000	0.028330
AACGAATTGACCGGTT-1	0.000000	0.146543
AACGACATTAAGATGG-1	0.000000	0.320166

deconv_adata.uns['deconv'].iloc[:5, :7]

	DCIS 2	Endothelial
AACCACTGCCATAGCC-1	0.172999	0.000000
AACCGCCAGACTACTT-1	0.000000	0.000000
AACGAAGCGTGGAAGT-1	0.000000	0.028330
AACGAATTGACCGGTT-1	0.000000	0.146543
AACGACATTAAGATGG-1	0.000000	0.320166

Spatial Visualization of Deconvolution Results

import matplotlib as mpl
i='Stromal'
plot_adata = deconv_adata.copy()
plot_adata.obs=plot_adata.uns['deconv']
mpl.rcdefaults()
fig=sc.pl.spatial(
    plot_adata,
    color=i,
    img_key=None,
    alpha=1,
    color_map='mako',
    size=1.3,
    title=f'BRCA_10X predicted {i}',
    spot_size=150,
    return_fig=True
)

No description has been provided for this image

Mapping scRNA-seq to ST data

Read the cell count data for each spot(optional).

If the cell count information is not known beforehand, you can use ct.tl.predict_cell_num to segment HE images using Cellpose, or let the program perform an internal estimation. Here we use the cell count data collected from paper of Sarah E. B. Taylor et al.

cell_num = pd.read_csv(r"C:\Users\wangxueying\project\CytoBulk\case\10x\sub3\cell_num_3.csv",index_col=0)
deconv_adata=deconv_adata[cell_num.index,:]
cell_num=cell_num.loc[deconv_adata.obs_names,:]
deconv_adata.obsm['cell_num'] = cell_num
deconv_adata.uns['deconv'] = deconv_adata.uns['deconv'].loc[deconv_adata.obs_names,:]

deconv_adata.obsm['cell_num'].head(5)

	cell_num
CCGTGGCGTCGGTGCT-1	54
TCCGTGTTCTGTTCTG-1	39
TCGTAAGGCGTTCAGG-1	57
CTGCCGATTGCGATGT-1	7
CTTCCGTCGCTGCATC-1	37

Mapping

reconstructed_cell, reconstructed_adata = ct.tl.st_mapping(st_adata = deconv_adata,
                                            sc_adata = sc_ori_adata,
                                            out_dir="C:/Users/wangxueying/project/CytoBulk/case/10x/sub3/demo",
                                            project="BRCA_10X",
                                            annotation_key='cell_type')

=================================================================================================
Start to mapping bulk data with single cell dataset.
Down/up sample of scRNA-seq data according to estimated cell type fractions
Time to down/up sample scRNA-seq data: 0.0 seconds
Time to finish mapping: 316.4 seconds
=========================================================================================================================================

The matching relationship between cells and spots is stored in reconstructed_cell. For single cells mapped to the same spot, we aggregated their gene expression, and the aggregated expression is stored in reconstructed_adata. The original ST expression data is stored in reconstructed_adata.layers['original_st'].

reconstructed_cell.head(5)

	spot_id	cell_id
0	CATTAGGATAGTGAAT-1	CTATTAGGTTCACCAG-1
1	CATTAGGATAGTGAAT-1	CACGAACCAATATGGT-1
2	CATTAGGATAGTGAAT-1	ATCCCAATCAGTAGCG-1
3	CATTAGGATAGTGAAT-1	CCCTGTTGTGGCCCAT-1
4	CATTAGGATAGTGAAT-1	AAAGTAGCAAATGACT-1

Check the pearson correlation between original expression data and reconstructed expression data.

from scipy.stats import pearsonr
import numpy as np
correlations = []
for spot in reconstructed_adata.obs_names:
    if spot in reconstructed_adata.obs_names:
        expr1 = reconstructed_adata[spot, :].X[0]
        expr2 = reconstructed_adata.layers['original_st'][reconstructed_adata.obs_names.get_loc(spot), :]
        #r, p = pearsonr(expr1.X[0], expr2.X[0])
        r, p = pearsonr(expr1, expr2)
        correlations.append((spot, r, p))

correlations_df = pd.DataFrame(correlations, columns=['spot', 'Pearson R', 'pvalue'])
correlations_df['X']=reconstructed_adata.obsm['spatial'][:,0]
correlations_df['Y']=reconstructed_adata.obsm['spatial'][:,1]
correlations_df.set_index('spot',inplace=True)
reconstructed_adata.obs=correlations_df

np.mean(correlations_df['Pearson R'])

0.691908648515156

Visualization of correlation result

import matplotlib.pyplot as plt
fig=sc.pl.spatial(
    reconstructed_adata,
    color='Pearson R',
    img_key=None,
    alpha=0.8,
    size=2,
    title=f'Reconstructed BRCA 10X \nmean Pearson correlation = 0.692\ngene number = 18,082',
    return_fig=True,
    add_outline=False,
    frameon=False,
    spot_size=.8,
    library_id='CytAssist_FFPE_Human_Breast_Cancer',
    color_map="mako_r"
)