Validation & Design Rationale¶

Evidence for design decisions and validation data for the SpatialCore cell typing pipeline.

This page documents why the pipeline works the way it does, with actual data from validation experiments.

Why Custom Models?¶

SpatialCore trains custom CellTypist models instead of using pre-trained ones. This is not about improving CellTypist, it's about solving a practical engineering problem.

The Gene Overlap Problem:

Pre-trained CellTypist models were trained on full scRNA-seq transcriptomes (~15,000 genes). Spatial panels can contain as low as 400 genes.

Measured gene overlap (Xenium Human Multi-Tissue Panel, 377 genes):

Pre-trained Model	Training Genes	Overlap	% Utilized
Immune_All_High.pkl	15,789	31	8.2%
Immune_All_Low.pkl	15,789	31	8.2%
Human_Lung_Atlas.pkl	15,203	29	7.7%
Adult_Human_Skin.pkl	14,987	28	7.4%

SpatialCore custom model:

Custom Model	Training Genes	Overlap	% Utilized
lung_custom_v1.pkl	377	377	100%

The pre-trained models ignore 91–93% of their learned features when applied to spatial data. Every coefficient they learned for genes outside the panel becomes meaningless.

What This Means for Predictions:

When 92% of features are missing, the model is essentially guessing. The decision scores shift systematically negative, causing:

Crushed confidence scores — raw probabilities < 0.1 even for correct predictions
Reduced discrimination — less separation between correct and incorrect calls
Unstable rankings — small changes in the 8% overlap can flip predictions

Custom models trained on the exact panel genes produce decision scores centered around the natural boundary, with proper separation between classes.

Reference Data Selection¶

Query vs Collections API:

CellxGene Census offers two access methods. We recommend Query for cell typing training:

Aspect	Collections API	Query API
Access pattern	Download entire datasets	Filter by tissue/disease/cell type
Gene format	Mixed (Ensembl, HUGO, hybrid)	Consistent Ensembl IDs
Cell type labels	Dataset-dependent	Standardized CL IDs (when available)
Source diversity	Single study per download	Cross-study aggregation
Recommended for	Reproducibility of specific studies	Training custom models

Query API validation (lung tissue, 2024-01-18):

Total cells: 847,291
Unique sources: 23
Cell types with CL IDs: 89%
Gene format: 100% Ensembl

Cell Type Diversity vs Sample Size:

We tested how cell type diversity scales with max_cells to establish sampling recommendations:

max_cells	actual_cells	cell_types	singletons	<10 cells (filtered)	≥10 cells (kept)
20,000	20,000	898	16	783	115
40,000	40,000	898	13	765	133
80,000	80,000	898	15	745	153
100,000	100,000	898	14	740	158

Key findings:

Singletons are stable (~14-16 regardless of sample size) — these are annotation artifacts, not sampling effects
82-87% of cell types are filtered by min_cells_per_type=10 — most CellxGene labels are too rare for classifier training
Doubling cells → +15-18% usable cell types — diminishing returns above 100K cells
Recommendation: max_cells=100000 balances diversity (158 types) with memory (~2GB per reference)

Label Quality Validation:

We analyzed label quality in reference data to establish filtering thresholds.

Singleton analysis (types with < 10 cells):

Query: tissue="lung", 100K cells sampled

Cell types with < 10 cells: 47
Examples:
  - "CD4-positive, alpha-beta cytotoxic T cell": 3 cells
  - "conventional dendritic cell type 3": 2 cells
  - "pulmonary ionocyte": 1 cell

These singletons cause training instability and should be filtered.

Recommended filtering parameters:

combine_references(
    ...,
    min_cells_per_type=10,     # Remove singletons
    filter_min_cells=True,      # Apply the filter
    exclude_labels=DEFAULT_EXCLUDE_LABELS,  # Remove "unknown", "doublet", etc.
)

Source-Aware Balancing¶

The Problem:

When combining multiple reference datasets, larger atlases dominate training.

Example scenario (lung tissue):

Tissue Atlas: 27,000 cells (natural proportions)
  - Macrophage: 8,100 (30%)
  - Alveolar macrophage: 4,050 (15%)
  - Type II pneumocyte: 5,400 (20%)
  - Type I pneumocyte: 2,700 (10%)
  - Fibroblast: 4,050 (15%)
  - Epithelial cell: 2,700 (10%)

FACS Lymphoid: 8,300 cells (sorted populations)
  - CD4+ T cell: 3,000 (36%)
  - CD8+ T cell: 2,000 (24%)
  - NK cell: 1,500 (18%)
  - B cell: 1,000 (12%)
  - Plasma cell: 500 (6%)
  - Macrophage: 300 (4%)

NAIVE CONCATENATION for Macrophage:
  8,400 total (96% from Tissue Atlas, 4% from FACS)
  Problem: Model learns Tissue Atlas batch effects, ignores FACS diversity

Cap & Fill Algorithm:

SpatialCore implements source-aware "Cap & Fill" balancing:

FOR each cell_type:
  1. Calculate per-source proportions
  2. Allocate target based on source_balance mode
  3. Cap at available cells per source
  4. Fill shortfall from sources with capacity
  5. Sample without replacement

Validation test:

# Test fixture: Macrophage appears in BOTH sources
# Tissue Atlas: 8,100 macrophages
# FACS Lymphoid: 300 macrophages

# PROPORTIONAL balance (target=2000 per type)
result = subsample_balanced(
    adata,
    label_column="cell_type",
    source_column="reference_source",
    source_balance="proportional",
    max_cells_per_type=2000,
)

# Validation (Macrophage allocation):
  Tissue Atlas: 2000 × (8100/8400) = 1,929 cells (96.4%)
  FACS Lymphoid: 2000 × (300/8400) = 71 cells (3.6%)
  Total: 2,000 ✓

# EQUAL balance for FACS data:
result = subsample_balanced(
    adata,
    source_balance="equal",
    max_cells_per_type=2000,
)

# Validation (Macrophage allocation with equal balance):
  Tissue Atlas: 1,000 cells (50%)
  FACS Lymphoid: 300 cells (all available, fills from other source)
  Total: 1,300 ✓  (FACS capped at available, backfilled)

Semantic Grouping Validation:

Different references use different names for the same cell type. Grouping by CL ID ensures proper balancing.

Test scenario:

# Reference A labels: "CD4-positive, alpha-beta T cell"
# Reference B labels: "CD4+ T cells"
# Both map to: CL:0000624

# WITHOUT group_by_column (text labels):
# These are treated as DIFFERENT types → incorrect balancing

# WITH group_by_column="cell_type_ontology_term_id":
# Both grouped under CL:0000624 → correct balancing

Validation results:

Scenario	Cell Type	Source A	Source B	Total
Without grouping	"CD4-positive, alpha-beta T cell"	2000	0	2000
Without grouping	"CD4+ T cells"	0	1000	1000
With grouping	CL:0000624	1667	333	2000

Enriched Reference Handling¶

The Problem:

FACS-sorted or enriched references contain artificially high proportions of specific cell types.

Example:

Tissue atlas (20K cells):
  - T cells: 3,000 (15%)
  - Macrophages: 8,000 (40%)
  - Epithelial: 9,000 (45%)
  - NK cells: 0 (absent from tissue sample)

FACS-sorted NK reference (5K cells):
  - NK cells: 5,000 (100%)  ← pure enriched population

Combined (naive):
  - NK cells: 5,000 / 25,000 = 20% of training
  - Biological reality: let's say NK should be ~0.25% in lung tissue
  - This is 80× the biological frequency!

target_proportions Solution:

The target_proportions parameter caps enriched cell types at expected biological frequencies:

balanced = subsample_balanced(
    combined,
    label_column="cell_type",
    max_cells_per_type=5000,
    target_proportions={
        "NK cell": 0.0025,      # 0.25% biological frequency
        "plasma cell": 0.005,   # 0.5% biological frequency
    },
)

Users can provide imperically determind target_proportions in .csv or .json formats to spatialcore for target proportion matching.

Validation:

Input: 25,000 cells combined (20K tissue + 5K FACS pure NK)
  - NK cells: 5,000 (from FACS reference only)
  - Target proportion: 0.25% = 0.0025
  - Expected: 0.0025 × 25,000 = 62-63 cells

Output with target_proportions:
  - NK cells: 62 cells ✓ (reduced from 5,000!)
  - T cells, Macrophages, Epithelial: capped at max_cells_per_type as usual

Where to get biological proportions:

Source	Use Case	Example
Literature	Known tissue composition	"NK cells are x% of lung tissue"
Flow cytometry	Gold standard for immune	FACS panel quantification
Pilot scRNA-seq	Same tissue, unenriched	Large atlas cell type frequencies
Expert knowledge	Domain expertise	Pulmonologist/pathologist input

Confidence Calibration¶

Why Z-Score Transformation?

CellTypist outputs logistic regression decision scores, transformed to probabilities via sigmoid:

probability = sigmoid(decision_score) = 1 / (1 + exp(-decision_score))

The problem: When applied to spatial data with low gene overlap, decision scores shift systematically negative:

scRNA-seq training distribution:
  decision scores: mean ≈ 0, range [-3, +3]
  probabilities: centered around 0.5

Spatial inference distribution:
  decision scores: mean ≈ -5, range [-8, -2]
  probabilities: all < 0.1

Example:
  decision_score = -4.0
  sigmoid(-4.0) = 0.018 (1.8% "confident")
  But this cell is correctly classified!

Z-Score Solution:

SpatialCore z-normalizes decision scores within the spatial dataset:

z_score = (decision_score - mean(all_scores)) / std(all_scores)
confidence = sigmoid(z_score)

Interpretation: - confidence > 0.5 → above-average score for this dataset - confidence > 0.8 → well above average (recommended threshold)

Validation data:

Metric	Raw Probability	Z-Score Transformed
Mean confidence (all cells)	0.12	0.50
Mean confidence (assigned)	0.18	0.73
Confidence range	[0.001, 0.45]	[0.05, 0.99]
Interpretability	Low (what does 0.12 mean?)	High (above/below average)

Ontology Mapping Validation¶

4-Tier Matching Performance:

We validated the ontology matching system on 500+ unique cell type labels from CellxGene Census:

Tier	Strategy	Labels Matched	%
0	Pattern canonicalization	287	57.4%
1	Exact match	156	31.2%
2	Token-based	38	7.6%
3	Word overlap	12	2.4%
—	Unmapped	7	1.4%

Total coverage: 98.6%

Pattern Matching Examples:

The pattern canonicalization (Tier 0) handles common variations:

Input Label	Canonical Form	CL ID
"CD4+ T cells"	"cd4-positive, alpha-beta t cell"	CL:0000624
"Macrophages"	"macrophage"	CL:0000235
"NK cells"	"natural killer cell"	CL:0000623
"Tregs"	"regulatory t cell"	CL:0000815
"DCs"	"dendritic cell"	CL:0000451
"Club (nasal)"	"club cell"	CL:0000158

Unmapped Label Analysis:

The 1.4% unmapped labels typically fall into these categories:

Category	Example	Reason
Novel subtypes	"CD8+ tissue-resident memory T cell subset 3"	Too specific for CL
Ambiguous	"other"	Not a cell type
Typos	"macrophae"	Misspelling
Custom annotations	"Cluster_12"	Dataset-specific

Quality Metrics¶

We recommend evaluating annotation quality using:

Metric	Description	Good Value
Marker CV	Coefficient of variation for canonical markers within cell types	< 0.5
DEG detection	Number of significant DEGs per cell type	> 50
DEG specificity	DEGs specific to each type vs shared	> 80% unique
Confidence distribution	Shape of confidence scores	Bimodal (high/low)

Note: % unassigned alone is not a quality metric—it depends on confidence threshold and biological heterogeneity.

Packaged Data Files¶

ontology_index.json¶

Location: src/spatialcore/data/ontology_mappings/ontology_index.json

{
  "metadata": {
    "cl_terms": 15963,
    "version": "2024-01-15"
  },
  "cl": {
    "b cell": {"id": "CL:0000236", "name": "B cell"},
    "t cell": {"id": "CL:0000084", "name": "T cell"},
    "macrophage": {"id": "CL:0000235", "name": "macrophage"},
    ...
  }
}

Usage:

from spatialcore.annotation import load_ontology_index

index = load_ontology_index()
print(f"Total terms: {index['metadata']['cl_terms']}")
print(index['cl']['macrophage'])
# {'id': 'CL:0000235', 'name': 'macrophage'}

canonical_markers.json¶

Location: src/spatialcore/data/markers/canonical_markers.json

{
  "markers": {
    "macrophage": ["CD163", "CD68", "MARCO", "CSF1R", "MERTK", "C1QA", "C1QB", "C1QC", "MRC1"],
    "t cell": ["CD3D", "CD3G", "CD3E", "IL7R", "TRBC1"],
    "b cell": ["CD19", "MS4A1", "CD79A", "CD79B", "IGHM"],
    "fibroblast": ["COL1A1", "DCN", "PDGFRA", "VIM", "LUM"],
    "endothelial cell": ["PECAM1", "VWF", "CDH5", "ERG", "FLT1"],
    ...
  }
}

Usage:

from spatialcore.annotation import load_canonical_markers

markers = load_canonical_markers()
print(markers["macrophage"])
# ['CD163', 'CD68', 'MARCO', 'CSF1R', 'MERTK', 'C1QA', 'C1QB', 'C1QC', 'MRC1']

ensembl_to_hugo_human.tsv¶

Location: src/spatialcore/data/gene_mappings/ensembl_to_hugo_human.tsv

ensembl_gene_id    hgnc_symbol
ENSG00000121410    A1BG
ENSG00000268895    A1BG-AS1
ENSG00000148584    A1CF
...

Usage:

from spatialcore.core.utils import load_ensembl_to_hugo_mapping

mapping = load_ensembl_to_hugo_mapping()
print(mapping["ENSG00000121410"])
# "A1BG"

References¶

CellTypist: Domínguez Conde et al., 2022. Cross-tissue immune cell analysis reveals tissue-specific features in humans
Cell Ontology: Diehl et al., 2016. The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability
CellxGene Census: CZI Single-Cell Biology. cellxgene.cziscience.com