STRING protein-protein interaction network analysis with pyPPI: query STRING database, build PPI graphs, expand with add_nodes, and visualize styled networks for bulk gene lists.
Use this skill when the user has a gene list and wants to explore protein-protein interactions via the STRING database. The workflow covers species selection, STRING API queries, network construction, and styled visualization through ov.bulk.pyPPI.
import omicverse as ov
ov.style() # or ov.plot_set()
gene_list = ['FAA4', 'POX1', 'FAT1', 'FAS2', 'FAS1', 'FAA1', 'OLE1', 'YJU3', 'TGL3', 'INA1', 'TGL5']
# Remove duplicates and validate
gene_list = list(dict.fromkeys(gene_list)) # preserves order
assert len(gene_list) >= 2, "Need at least 2 genes for PPI analysis"
# Map genes to types and colours for the network figure
gene_type_dict = dict(zip(gene_list, ['Lipid_synthesis'] * 5 + ['Lipid_transport'] * 6))
gene_color_dict = dict(zip(gene_list, ['#F7828A'] * 5 + ['#9CCCA4'] * 6))
Consistent group labels and colours improve legend readability. Every gene in gene_list must appear in both dictionaries.
G_res = ov.bulk.string_interaction(gene_list, species_id)
print(G_res.head())
Inspect the DataFrame for combined_score and evidence channels to verify coverage before building the network.
ppi = ov.bulk.pyPPI(
gene=gene_list,
gene_type_dict=gene_type_dict,
gene_color_dict=gene_color_dict,
species=species_id,
)
ppi.interaction_analysis()
ppi.plot_network()
STRING requires NCBI taxonomy integer IDs, not species names. The agent must map the user's species to the correct ID.
| Species | Taxonomy ID | Gene Symbol Format |
|---|---|---|
| Human | 9606 | Official HGNC symbols (e.g., TP53, BRCA1) |
| Mouse | 10090 | Official MGI symbols (e.g., Trp53, Brca1) |
| Rat | 10116 | Official RGD symbols |
| Yeast (S. cerevisiae) | 4932 | Systematic names (e.g., YOR317W) or standard names (e.g., FAA4) |
| Zebrafish | 7955 | ZFIN symbols |
| Drosophila | 7227 | FlyBase symbols |
| C. elegans | 6239 | WormBase symbols |
| Arabidopsis | 3702 | TAIR symbols |
add_nodesSmall gene lists (<10 genes) often produce disconnected networks because the query genes may not directly interact. The add_nodes parameter asks STRING to include its top predicted interaction partners.
# For sparse networks: expand by adding STRING's top predicted partners
ppi.interaction_analysis(add_nodes=5) # adds up to 5 STRING-predicted partners
# For focused networks: no expansion (default)
ppi.interaction_analysis() # only edges between input genes
Use add_nodes when the initial network is disconnected or sparse. The added nodes are real proteins from STRING's database, but they may not be biologically relevant to your specific study—verify them before including in publications.
# CORRECT for human — official HGNC symbols
gene_list = ['TP53', 'BRCA1', 'MDM2']
G_res = ov.bulk.string_interaction(gene_list, 9606)
# WRONG — Ensembl IDs won't match STRING's symbol index
# gene_list = ['ENSG00000141510', 'ENSG00000012048'] # No interactions returned!
If genes are in Ensembl format, map them to symbols first (e.g., via ov.bulk.Gene_mapping()).
# Validate gene list
assert gene_list and len(gene_list) >= 2, "Need at least 2 genes for PPI"
gene_list = list(dict.fromkeys(gene_list)) # deduplicate
# Verify all genes appear in metadata dicts
for g in gene_list:
assert g in gene_type_dict, f"Gene '{g}' missing from gene_type_dict"
assert g in gene_color_dict, f"Gene '{g}' missing from gene_color_dict"
# Verify species_id is a valid integer
assert isinstance(species_id, int) and species_id > 0, f"species_id must be a positive integer, got {species_id}"
# After query: check if interactions were found
G_res = ov.bulk.string_interaction(gene_list, species_id)
if G_res is None or len(G_res) == 0:
print("WARNING: No STRING interactions found. Check species_id and gene symbol format.")
species_id matches the gene symbol format. Yeast uses systematic names or standard gene names, not human-style symbols. Verify at string-db.org manually.HTTPError 429 (rate-limited): STRING limits API requests. Wait 60 seconds between queries, or provide a cached interaction table from a previous run.ov.bulk.Gene_mapping().add_nodes=5 (or higher) in interaction_analysis() to expand the network with STRING-predicted partners. Alternatively, lower the combined_score threshold.KeyError in gene_color_dict during plotting: Every gene in gene_list must have an entry in both gene_type_dict and gene_color_dict. After adding nodes with add_nodes, the expanded gene list may include new genes—update the dictionaries accordingly.t_network.ipynbreference.md