Name: Datasets Loading
Author: omicverse

搜索技能.../

Datasets Loading | Skills Pool

Function	Cells	Genes	Description
`ov.datasets.dentate_gyrus()`	18,213	27,998	Dentate gyrus (loom)
`ov.datasets.dentate_gyrus_scvelo()`	2,930	13,913	DG subset from scVelo
`ov.datasets.zebrafish()`	4,181	16,940	Zebrafish developmental
`ov.datasets.pancreatic_endocrinogenesis()`	—	—	Pancreatic epithelial
`ov.datasets.pancreas_cellrank()`	2,930	13,913	Pancreas cellrank benchmark
`ov.datasets.scnt_seq_neuron_splicing()`	13,476	44,021	scNT-seq neuron splicing
`ov.datasets.scnt_seq_neuron_labeling()`	3,060	24,078	scNT-seq neuron labeling
`ov.datasets.sceu_seq_rpe1()`	~2,930	~13,913	scEU-seq RPE1
`ov.datasets.sceu_seq_organoid()`	3,831	9,157	scEU-seq organoid
`ov.datasets.haber()`	7,216	27,998	Intestinal epithelium
`ov.datasets.chromaffin()`	—	—	Chromaffin cell lineage
`ov.datasets.hg_forebrain_glutamatergic()`	1,720	32,738	Human forebrain
`ov.datasets.toggleswitch()`	200	2	Two-gene simulation

Function	Description
`ov.datasets.burczynski06()`	UC/CD PBMC bulk (127 samples)
`ov.datasets.moignard15()`	Embryo hematopoiesis qRT-PCR
`ov.datasets.decov_bulk_covid_bulk()`	COVID-19 PBMC bulk
`ov.datasets.decov_bulk_covid_single()`	COVID-19 PBMC single-cell ref

import omicverse as ov

# Basic mock dataset
adata = ov.datasets.create_mock_dataset(
    n_cells=2000,
    n_genes=1500,
    n_cell_types=6,
    with_clustering=False,
    random_state=42,
)
# adata.obs: cell_type, sample_id, condition, tissue
# adata.var: gene_symbols, highly_variable

# With full preprocessing (normalized, PCA, UMAP, leiden)
adata = ov.datasets.create_mock_dataset(
    n_cells=5000,
    n_genes=3000,
    n_cell_types=10,
    with_clustering=True,
)

from omicverse.datasets import predefined_signatures, load_signatures_from_file

# Available signature keys
print(list(predefined_signatures.keys()))
# ['cell_cycle_human', 'cell_cycle_mouse', 'gender_human', 'gender_mouse',
#  'mitochondrial_genes_human', 'mitochondrial_genes_mouse',
#  'ribosomal_genes_human', 'ribosomal_genes_mouse',
#  'apoptosis_human', 'apoptosis_mouse',
#  'human_lung', 'mouse_lung', 'mouse_brain', 'mouse_liver', 'emt_human']

# Load a signature → dict[str, list[str]]
cell_cycle = load_signatures_from_file(predefined_signatures['cell_cycle_human'])
# {'S_genes': ['MCM5', 'PCNA', ...], 'G2M_genes': ['HMGB2', 'CDK1', ...]}

# Use with scoring
import scanpy as sc
sc.tl.score_genes_cell_cycle(adata, s_genes=cell_cycle['S_genes'],
                              g2m_genes=cell_cycle['G2M_genes'])

# CORRECT: use ov.datasets for standard benchmarks
adata = ov.datasets.pbmc3k()

# WRONG: manually downloading what's already built-in
# import urllib.request
# urllib.request.urlretrieve('https://...', 'pbmc3k.h5ad')  # unnecessary!
# adata = ov.read('pbmc3k.h5ad')

# CORRECT: pbmc3k(processed=True) for pre-processed version
adata = ov.datasets.pbmc3k(processed=True)

# WRONG: loading raw then manually preprocessing for a demo
# adata = ov.datasets.pbmc3k()
# sc.pp.normalize_total(adata)  # unnecessary if you just need a quick demo

# CORRECT: mock data for testing (no network needed)
adata = ov.datasets.create_mock_dataset(n_cells=500, n_genes=200)

# WRONG: creating synthetic data manually with numpy
# X = np.random.poisson(1, (500, 200))  # missing metadata, layers, etc.

`ov.datasets.pbmc3k()`	2,700	32,738	10x PBMC3k (raw or processed)
`ov.datasets.pbmc8k()`	~8,000	—	10x PBMC 8k
`ov.datasets.paul15()`	2,730	3,451	Myeloid progenitors
`ov.datasets.krumsiek11()`	640	11	Myeloid differentiation simulation
`ov.datasets.bone_marrow()`	5,780	27,876	Bone marrow hematopoietic
`ov.datasets.hematopoiesis()`	—	—	Processed hematopoiesis
`ov.datasets.hematopoiesis_raw()`	—	—	Raw hematopoiesis
`ov.datasets.sc_ref_Lymph_Node()`	~10,000	~15,000	Lymph node reference
`ov.datasets.bhattacherjee()`	~5,000	~2,000	Mouse PFC cocaine study
`ov.datasets.human_tfs()`	—	—	Human TF list (DataFrame)

Function	Description
`ov.datasets.seqfish()`	SeqFISH spatial transcriptomics
`ov.datasets.multi_brain_5k()`	10x E18 mouse brain multiome (MuData)

Function	Description
`ov.datasets.create_mock_dataset()`	Configurable synthetic scRNA-seq
`ov.datasets.blobs()`	Gaussian blob clusters

Datasets Loading

OmicVerse Built-in Datasets

When to Use This Module

Dataset Catalog

Single-Cell

Datasets Loading

OmicVerse Built-in Datasets

When to Use This Module

Dataset Catalog

Single-Cell

RNA Velocity & Trajectories

Spatial & Multiome

Bulk RNA-seq & Deconvolution

Synthetic

Mock Data Generation

Predefined Gene Set Signatures

Critical API Reference

Caching Behavior

Troubleshooting

Dependencies

Examples

References

Deep Research

Data Analyst

Academic Researcher

Data Scientist

Biopython

Binary Analysis Patterns