PopV population-level cell annotation: 10 algorithms (SCVI, SCANVI, CellTypist, OnClass, RF, SVM, XGBoost, BBKNN, HARMONY, SCANORAMA), consensus voting, pretrained hub models.
PopV (Population Voting) annotates cell types by running up to 10 classification algorithms and aggregating predictions via majority voting. Unlike single-method annotation (SCSA, MetaTiME, CellTypist alone), PopV produces a consensus prediction that is more robust to individual algorithm failures. The module also supports ontology-aware voting via the Cell Ontology (CL) for hierarchical label resolution.
# Before PopV: verify reference has the cell type column
assert ref_labels_key in ref_adata.obs.columns, \
f"ref_adata.obs['{ref_labels_key}'] not found. Available: {list(ref_adata.obs.columns)}"
# Verify no NaN in reference labels
assert ref_adata.obs[ref_labels_key].notna().all(), \
f"NaN values in ref_adata.obs['{ref_labels_key}']. Use fillna() or drop these cells."
# Verify gene overlap
overlap = query_adata.var_names.intersection(ref_adata.var_names)
assert len(overlap) > 100, \
f"Only {len(overlap)} overlapping genes between query and reference. Check var_names format (ENSEMBL vs symbol)."
import omicverse as ov
# Process_Query preprocesses and concatenates query + reference
process_obj = ov.popv.Process_Query(
query_adata=query_adata,
ref_adata=ref_adata,
ref_labels_key='cell_type', # REQUIRED: column in ref_adata.obs
ref_batch_key='batch', # batch column in ref_adata.obs
query_batch_key='batch', # batch column in query_adata.obs (optional)
cl_obo_folder=False, # False to skip ontology, or path to CL .obo file
prediction_mode='retrain', # 'retrain' | 'inference' | 'fast'
unknown_celltype_label='unknown', # label for query cells
n_samples_per_label=300, # subsample reference per cell type
hvg=4000, # number of highly variable genes
save_path_trained_models='tmp/', # where to save models
pretrained_scvi_path=None, # path to pretrained scVI model (optional)
)
prediction_mode choices:
'retrain' — Train all models from scratch on reference+query. Most accurate, slowest.'inference' — Load previously saved models. Requires save_path_trained_models from prior run.'fast' — Skip integration-heavy algorithms. Uses FAST_ALGORITHMS subset.Preprocessing applied automatically:
layers['scvi_counts']# Run all algorithms and compute consensus
ov.popv.annotate_data(
process_obj.adata,
methods='all', # or list of specific algorithms
save_path='results/popv/', # saves predictions.csv here
methods_kwargs=None, # dict of per-method overrides
)
| Algorithm | Result Key | Type | Speed |
|---|---|---|---|
KNN_SCVI | popv_knn_on_scvi_prediction | Deep learning + KNN | Medium |
SCANVI_POPV | popv_scanvi_prediction | Semi-supervised DL | Medium |
CELLTYPIST | popv_celltypist_prediction | Logistic regression | Fast |
ONCLASS | popv_onclass_prediction | Ontology-guided | Medium |
Support_Vector | popv_svm_prediction | SVM | Fast |
XGboost | popv_xgboost_prediction | Gradient boosting | Fast |
KNN_HARMONY | popv_knn_harmony_prediction | Harmony + KNN | Fast |
KNN_BBKNN | popv_knn_bbknn_prediction | BBKNN + KNN | Fast |
Random_Forest | popv_rf_prediction | Random forest | Fast |
KNN_SCANORAMA | popv_knn_scanorama_prediction | Scanorama + KNN | Medium |
Algorithm subsets:
FAST_ALGORITHMS: KNN_SCVI, SCANVI_POPV, Support_Vector, XGboost, ONCLASS, CELLTYPIST (used with prediction_mode='fast')CURRENT_ALGORITHMS: All except Random_Forest and KNN_SCANORAMA (outdated)'all' or None: Uses CURRENT_ALGORITHMS (or FAST_ALGORITHMS in fast mode)# Run only fast classical methods
ov.popv.annotate_data(
process_obj.adata,
methods=['CELLTYPIST', 'Support_Vector', 'XGboost'],
)
# Override per-method parameters
ov.popv.annotate_data(
process_obj.adata,
methods=['KNN_SCVI', 'SCANVI_POPV'],
methods_kwargs={
'KNN_SCVI': {'train_kwargs': {'max_epochs': 50}},
'SCANVI_POPV': {'train_kwargs': {'max_epochs': 50}},
},
)
After annotate_data(), these columns appear in adata.obs:
| Column | Description |
|---|---|
popv_majority_vote_prediction | Majority vote across all methods |
popv_majority_vote_score | Number of agreeing methods |
popv_prediction | Ontology-aggregated consensus (if CL enabled) |
popv_prediction_score | Ontology consensus score |
# Agreement plots: confusion matrices per method vs consensus
ov.popv.make_agreement_plots(
process_obj.adata,
prediction_keys=process_obj.adata.uns['prediction_keys'],
popv_prediction_key='popv_prediction',
save_folder='results/popv/',
show=True,
)
# Bar plot: agreement score per cell type
ov.popv.agreement_score_bar_plot(
process_obj.adata,
popv_prediction_key='popv_prediction',
save_folder='results/popv/',
)
# Bar plot: prediction score distribution
ov.popv.prediction_score_bar_plot(
process_obj.adata,
popv_prediction_score='popv_prediction_score',
save_folder='results/popv/',
)
# Bar plot: cell type proportions (ref vs query)
ov.popv.celltype_ratio_bar_plot(
process_obj.adata,
popv_prediction='popv_prediction',
save_folder='results/popv/',
)
For large references (e.g., Human Cell Atlas), use pretrained models to skip training:
from omicverse.popv.hub import HubModel
# Pull pretrained model from HuggingFace
model = HubModel.pull_from_huggingface_hub(
repo_name='popv/immune_all',
cache_dir='models/popv/',
)
# Annotate query data directly (fast mode)
result_adata = model.annotate_data(
query_adata=query_adata,
query_batch_key='batch',
prediction_mode='fast',
methods=None, # uses model's default methods
)
# CORRECT: methods as list of strings matching class names
ov.popv.annotate_data(adata, methods=['KNN_SCVI', 'CELLTYPIST', 'Support_Vector'])
# WRONG: passing class objects or lowercase names
# ov.popv.annotate_data(adata, methods=[KNN_SCVI, CELLTYPIST]) # TypeError
# ov.popv.annotate_data(adata, methods=['knn_scvi']) # KeyError
# CORRECT: ref_labels_key must exist in ref_adata.obs before Process_Query
assert 'cell_type' in ref_adata.obs.columns
process_obj = ov.popv.Process_Query(ref_labels_key='cell_type', ...)
# WRONG: forgetting to set unknown_celltype_label causes NaN in voting
# process_obj = ov.popv.Process_Query(..., unknown_celltype_label=None) # NaN errors
# CORRECT: access consensus results after annotation
final_labels = process_obj.adata.obs['popv_majority_vote_prediction']
# or ontology-refined:
final_labels = process_obj.adata.obs['popv_prediction']
# WRONG: looking for results on the original query_adata
# query_adata.obs['popv_prediction'] # KeyError: results are on process_obj.adata
import omicverse.popv as popv
popv.settings.accelerator = 'gpu' # for scVI/scANVI training
popv.settings.cuml = True # for KNN/SVM/RF via cuML
popv.settings.n_jobs = 10 # parallel jobs for CPU methods
RuntimeError: CUDA out of memory during scVI/scANVI training: Reduce hvg (try 2000), decrease n_samples_per_label (try 100), or switch to prediction_mode='fast' which uses fewer epochs.methods_kwargs={'CELLTYPIST': {'method_kwargs': {'model': '/path/to/local/model.pkl'}}} to use a local model file.methods list.KeyError: 'gene_name' — gene identifier mismatch: Harmonize var_names between reference and query before calling Process_Query. Use adata.var_names = adata.var['gene_symbols'] if ENSEMBL IDs are in var_names.ValueError: batch_key contains NaN: Clean batch columns before PopV. Apply the batch validation pattern from the single-preprocessing skill: adata.obs['batch'] = adata.obs['batch'].fillna('unknown').astype('category').FileNotFoundError in inference mode: Ensure save_path_trained_models points to the same directory used during the original retrain run. Check that model files (.pt, .pkl, .joblib) exist.omicverse, scanpy, anndata, numpy, pandasscvi-tools, torch (for KNN_SCVI, SCANVI_POPV)scikit-learn, xgboost (for RF, SVM, XGBoost)harmonypy, bbknn, scanorama (for respective KNN methods)celltypist, OnClass (optional per method)obonet, pronto (for ontology-aware voting)huggingface_hub (for pretrained models)reference.md