TCGA bulk RNA-seq preprocessing with pyTCGA: GDC sample sheets, expression archives, clinical metadata, Kaplan-Meier survival analysis, and annotated AnnData export.
Use this skill for loading TCGA data from GDC downloads, building normalised expression matrices, attaching clinical metadata, and running survival analyses through ov.bulk.pyTCGA.
Confirm the user has three items from the GDC Data Portal:
gdc_sample_sheet.<date>.tsv — the sample sheet exportgdc_download_xxxxx/ directory with expression archivesclinical.cart.<date>/ directory with clinical XML/JSON filesimport omicverse as ov
import scanpy as sc
ov.plot_set()
aml_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir)
aml_tcga.adata_init() # Builds AnnData with raw counts, FPKM, and TPM layers
aml_tcga.adata.write_h5ad('data/ov_tcga_raw.h5ad', compression='gzip')
# To reload later:
new_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir)
new_tcga.adata_read('data/ov_tcga_raw.h5ad')
aml_tcga.adata_meta_init() # Gene ID → symbol mapping, patient info
aml_tcga.survial_init() # NOTE: "survial" spelling — see Critical API Reference below
# Single gene
aml_tcga.survival_analysis('MYC', layer='deseq_normalize', plot=True)
# All genes (can take minutes for large gene sets)
aml_tcga.survial_analysis_all() # NOTE: "survial" spelling
aml_tcga.adata.write_h5ad('data/ov_tcga_survival.h5ad', compression='gzip')
The pyTCGA API has an intentional spelling inconsistency. Two methods use "survial" (missing the 'v') while one uses the correct "survival":
| Method | Spelling | Purpose |
|---|---|---|
survial_init() | survial (no 'v') | Initialize survival metadata columns |
survival_analysis(gene, layer, plot) | survival (correct) | Single-gene Kaplan-Meier curve |
survial_analysis_all() | survial (no 'v') | Sweep all genes for survival significance |
# CORRECT — use the exact method names as documented
aml_tcga.survial_init() # "survial" — no 'v'
aml_tcga.survival_analysis('MYC', layer='deseq_normalize', plot=True) # "survival" — correct
aml_tcga.survial_analysis_all() # "survial" — no 'v'
# WRONG — these will raise AttributeError
# aml_tcga.survival_init() # AttributeError! Use survial_init()
# aml_tcga.survival_analysis_all() # AttributeError! Use survial_analysis_all()
survival_analysis() performs Kaplan-Meier analysis:
plot=True, renders survival curves with confidence intervalsLayer selection matters: Use layer='deseq_normalize' (recommended) because DESeq2 normalization accounts for library size and composition bias, making expression comparable across samples. Alternative: layer='tpm' for TPM-normalized values.
import os
# Before pyTCGA init: verify all paths exist
for name, path in [('sample_sheet', sample_sheet_path),
('downloads', download_dir),
('clinical', clinical_dir)]:
if not os.path.exists(path):
raise FileNotFoundError(f"TCGA {name} path not found: {path}")
# After adata_init(): verify expected layers were created
expected_layers = ['counts', 'fpkm', 'tpm']
for layer in expected_layers:
if layer not in aml_tcga.adata.layers:
print(f"WARNING: Missing layer '{layer}' — check if TCGA archives are fully extracted")
# Before survival analysis: verify metadata is initialized
if 'survial_init' not in dir(aml_tcga) or aml_tcga.adata.obs.shape[1] < 5:
print("WARNING: Run adata_meta_init() and survial_init() before survival analysis")
AttributeError: 'pyTCGA' object has no attribute 'survival_init': Use the misspelled name survial_init() (missing 'v'). Same for survial_analysis_all(). See Critical API Reference above.KeyError during adata_meta_init(): Gene IDs in the expression matrix don't match expected format. TCGA uses ENSG IDs; the method maps them to symbols internally. Ensure archives are from the same GDC download.clinical.cart.* directory contains complete XML files, not just metadata JSONs.survial_analysis_all() runs very slowly: This tests every gene individually. For a genome with ~20,000 genes, expect 5-15 minutes. Consider filtering to genes of interest first.deseq_normalize layer: This layer is created during adata_meta_init(). If absent, re-run the metadata initialization step..h5ad."t_tcga.ipynbreference.md