Benchmark: Custom vs Pre-trained Cell Type Annotation¶
This benchmark compares two cell typing approaches on spatial transcriptomics data: out-of-box CellTypist with a pre-trained model versus SpatialCore's custom training pipeline. The results demonstrate why panel-specific training is essential for spatial data.
| Aspect | Standard CellTypist | SpatialCore Pipeline |
|---|---|---|
| Model type | Pre-trained (Immune_All, etc.) | Custom (panel-specific) |
| Gene overlap | ~5-9% on 400-gene panels | 100% |
| Confidence metric | Raw sigmoid probability | Z-score transformed |
| Threshold meaning | "Model >50% likely" | "Above average for this dataset" |
| Ontology mapping | Model-dependent labels | Cell Ontology (CL) IDs |
| Multi-reference handling | N/A | Source-aware balancing |
The Problem:
Pre-trained classifiers face two challenges on spatial data:
-
Gene overlap mismatch - Pre-trained models learn from RNA-seq (~20,000 genes), but spatial panels contain only 300-1,000 targeted genes. At inference time, 90-95% of learned features are missing.
-
Reference bias - Single-source training data introduces batch effects and over-represents common cell types at the expense of rare populations.
SpatialCore addresses both:
- Panel-specific training - Train on exactly the genes in your spatial panel (100% overlap)
- CellxGene integration - Download tissue-matched references from 60M+ cells
- Source-aware balancing -
subsample_balanced()ensures fair representation across sources
Dataset: 10x Genomics Xenium | Human lung (NSCLC) | 93,162 cells | 518 panel genes
Step 1: Acquiring Reference Data¶
The first step is obtaining tissue-matched scRNA-seq references. SpatialCore integrates with CZ CELLxGENE Discover Census, providing access to 60M+ cells with standardized Cell Ontology labels.
from spatialcore.annotation import acquire_reference
from pathlib import Path
REFERENCE_DIR = Path("references/cellxgene/lung")
REFERENCE_DIR.mkdir(parents=True, exist_ok=True)
# Download healthy lung tissue (~100k cells)
acquire_reference(
source="cellxgene://?tissue=lung&disease=normal",
output=REFERENCE_DIR / "healthy_lung.h5ad",
max_cells=100000,
)
# Download NSCLC tumor samples (~100k cells)
acquire_reference(
source="cellxgene://?tissue=lung&disease=non-small cell lung carcinoma",
output=REFERENCE_DIR / "nsclc.h5ad",
max_cells=100000,
)
Why CellxGene?
- 60M+ cells from 700+ curated datasets
- Standardized Cell Ontology (CL) labels
- Tissue and disease filtering
- Programmatic access via Census API
For validation of the CellxGene download and subsampling approach, see validation.md.
Step 2: The Baseline (Standalone CellTypist)¶
We establish a baseline using out-of-box CellTypist with the pre-trained Human Lung Atlas model. This represents the typical user experience when applying pre-trained models to spatial data.
import scanpy as sc
import celltypist
from celltypist import models
# Load spatial data
adata = sc.read_h5ad("xenium_lung_cancer_clustered.h5ad")
# Download and load pre-trained model
models.download_models(model="Human_Lung_Atlas.pkl")
model = models.Model.load(model="Human_Lung_Atlas.pkl")
# Check gene overlap
model_genes = set(model.features)
query_genes = set(adata.var_names)
overlap = model_genes & query_genes
overlap_pct = 100 * len(overlap) / len(model_genes)
print(f"Model genes: {len(model_genes):,}")
print(f"Query genes: {len(query_genes):,}")
print(f"Overlap: {len(overlap):,} ({overlap_pct:.1f}%)")
# Output: Overlap: 356 (7.1%)
# Run annotation
predictions = celltypist.annotate(adata, model=model, majority_voting=False)
# Check confidence
confidence = predictions.probability_matrix.max(axis=1).values
low_conf = (confidence < 0.5).mean()
print(f"Below 0.5 threshold: {low_conf:.1%}")
# Output: Below 0.5 threshold: 98.0%
Result: 98% of cells fall below the confidence threshold.
With only 7% gene overlap, the model cannot make confident predictions. The missing 93% of features contain critical discriminative information the classifier learned during training.
Step 3: The SpatialCore Solution¶
SpatialCore solves both problems - gene overlap and reference bias - in a single API call. The train_and_annotate() function trains a custom CellTypist model on your exact panel genes, then applies it with z-score confidence normalization.
The Full Pipeline:
from spatialcore.annotation import train_and_annotate, discover_training_data
import scanpy as sc
# Load spatial data
adata = sc.read_h5ad("xenium_lung_cancer_clustered.h5ad")
# Discover available references
datasets = discover_training_data("references/cellxgene/lung")
reference_paths = [ds.path for ds in datasets]
# Train custom model and annotate
adata = train_and_annotate(
adata,
references=reference_paths,
label_columns=["cell_type"] * len(reference_paths),
tissue="lung",
balance_strategy="proportional",
max_cells_per_type=5000,
max_cells_per_ref=100000,
confidence_threshold=0.8,
model_output="models/lung_nsclc_custom_v1.pkl",
plot_output="plots/",
add_ontology=True,
generate_plots=True,
)
# Check results
print(f"Cell types: {adata.obs['cell_type'].nunique()}")
print(f"Mean confidence: {adata.obs['cell_type_confidence'].mean():.3f}")
print(f"Unassigned: {(adata.obs['cell_type'] == 'Unassigned').mean():.2%}")
# Output: Unassigned: 0.03%
What train_and_annotate() Does:
The pipeline executes 9 stages:
- Extract panel genes - Gets gene names from spatial data
- Load references - Combines multiple h5ad files with Ensembl-to-HUGO normalization
- Fill ontology IDs - Maps cell type labels to Cell Ontology (CL) terms
- Source-aware balancing -
subsample_balanced()with "Cap & Fill" strategy - Train CellTypist model - SGD classifier on balanced, panel-subset data
- Annotate spatial data - Apply model with z-score confidence
- Apply threshold - Mark low-confidence cells as Unassigned
- Map to ontology - Add CL IDs to predictions
- Generate plots - DEG heatmap, 2D validation, confidence plots
Source-Aware Balancing:
When combining multiple references, some datasets may dominate others. The subsample_balanced() function prevents this through "Cap & Fill" balancing:
from spatialcore.annotation import subsample_balanced
# Balance training data across sources and cell types
balanced = subsample_balanced(
combined_references,
label_column="cell_type_ontology_label",
group_by_column="cell_type_ontology_term_id", # Group by CL ID
source_column="reference_source",
source_balance="proportional",
max_cells_per_type=5000,
copy=True,
)
Why this matters:
- Source balance - Each reference contributes proportionally to each cell type
- CL ID grouping - Semantic synonyms (e.g., "CD4+ T cell" and "CD4-positive, alpha-beta T cell") are grouped together
- Cell type balance - Rare types get adequate representation
For detailed scenarios and validation, see validation.md.
For the full API reference, see api.md.
Results¶
We evaluated both methods across seven metrics measuring annotation quality. All biological metrics (CV, fold change, purity, contamination) were calculated on all cells without threshold filtering, ensuring a fair comparison.
| Metric | Standalone | SpatialCore | Improvement |
|---|---|---|---|
| Gene Overlap (%) | 7.1% | 100% | 14x |
| Unknown Cells (%) | 98.0% | 0.03% | 3,800x |
| Marker CV | 1.77 | 1.23 | 30% lower |
| Marker log2FC | 1.50 | 2.17 | 45% higher |
| DEG log2FC | 3.93 | 4.96 | 26% higher |
| Marker Purity (%) | 39.0% | 51.7% | 33% higher |
| Contamination | 0.85 | 0.86 | ~equal |
SpatialCore wins on 6 of 7 metrics, with contamination approximately equal.
T Cell Subtype Collapsing
SpatialCore collapses granular T cell subtypes (e.g., "effector memory CD8-positive, alpha-beta T cell", "central memory CD4-positive, alpha-beta T cell") into their parent categories ("CD8-positive, alpha-beta T cell", "CD4-positive, alpha-beta T cell"). This is intentional: spatial transcriptomics panels typically lack the transcriptional resolution to discriminate between memory, effector, and naive T cell states. Collapsing these subtypes improves marker consistency (CV, purity) at the cost of slightly higher cross-type contamination, as related T cell populations now share canonical markers. For applications requiring granular T cell subtyping, consider targeted panels with T cell-specific markers or orthogonal validation (e.g., protein markers via immunofluorescence).
Gene Overlap: The Human Lung Atlas model contains 5,017 features learned from RNA-seq. Our Xenium panel has 518 genes. The intersection is only 356 genes (7.1% of the model's features). SpatialCore trains directly on the panel genes, achieving 100% overlap by construction.
Unassigned Rate: The practical consequence of low gene overlap is a high unassigned rate. Standalone CellTypist marks 98% of cells as unassigned (below 0.5 confidence). SpatialCore, with full gene overlap and z-score normalization, marks only 0.03% as unassigned - even with a stricter 0.8 threshold.
Confidence Distribution: The confidence distributions reveal why different thresholds are appropriate. Standalone CellTypist produces raw sigmoid probabilities that cluster near zero when features are missing. SpatialCore's z-score transformation normalizes confidence relative to the dataset, producing an interpretable distribution.
Biological Validation: Beyond confidence, we evaluate whether predictions align with known biology using canonical markers. Lower CV indicates more consistent marker expression within predicted populations; higher fold change indicates better marker specificity; higher purity indicates more cells expressing expected markers; lower contamination indicates cleaner boundaries between cell types.
| Metric | Plot |
|---|---|
| Marker CV (lower is better) | ![]() |
| Marker log2FC (higher is better) | ![]() |
| Canonical Marker Recovery | ![]() |
| DEG Effect Size | ![]() |
| Marker Purity (higher is better) | ![]() |
| Contamination (lower is better) | ![]() |
Validation Plots¶
We use SpatialCore's generate_annotation_plots() to render the same validation suite for both outputs, enabling direct visual comparison.
Both methods generate the same validation plot suite, enabling direct visual comparison.
SpatialCore:
| DEG Heatmap | 2D Marker Validation | Confidence |
|---|---|---|
![]() |
![]() |
![]() |
Standalone CellTypist:
| DEG Heatmap | 2D Marker Validation | Confidence |
|---|---|---|
![]() |
![]() |
![]() |
Conclusion¶
The gene overlap problem is the primary barrier to applying pre-trained classifiers on spatial data. When 93% of a model's learned features are absent, predictions become unreliable - as demonstrated by the 98% unassigned rate with standalone CellTypist.
SpatialCore addresses this through three complementary innovations: CellxGene integration for acquiring tissue-matched references, source-aware balancing for fair cell type representation, and panel-specific training for 100% gene overlap. Together, these reduce the unassigned rate to 0.03% while improving biological coherence across validation metrics.
For spatial transcriptomics cell typing, custom models trained on panel genes outperform pre-trained alternatives.
References¶
Spatial Data
- 10x Genomics (2023). FFPE Human Lung Cancer with Immuno-Oncology Panel. 10xgenomics.com/datasets
CellTypist
- Dominguez Conde C, et al. (2022). Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science. DOI: 10.1126/science.abl5197
- GitHub: github.com/Teichlab/celltypist | License: Apache 2.0
CellxGene Census
- CZI Single-Cell Biology, et al. (2023). CZ CELLxGENE Discover. bioRxiv. DOI: 10.1101/2023.10.30.563174
- Docs: chanzuckerberg.github.io/cellxgene-census | License: CC-BY 4.0















