技能檔案

Comprehensive Target Intelligence Gatherer

Name: Comprehensive Target Intelligence Gatherer
Author: wu-yc

Gather comprehensive biological target intelligence from 9 parallel research paths covering protein info, structure, interactions, pathways, expression, variants, drug interactions, and literature. Features collision-aware searches, evidence grading (T1-T4), explicit Open Targets coverage, and mandatory completeness auditing. Use when users ask about drug targets, proteins, genes, or need target validation, druggability assessment, or comprehensive target profiling.

wu-yc965 星標2026年3月6日

職業
分類: 生物信息學

技能內容

Gather complete target intelligence by exploring 9 parallel research paths. Supports targets identified by gene symbol, UniProt accession, Ensembl ID, or gene name.

KEY PRINCIPLES:

Report-first approach - Create report file FIRST, then populate progressively
Tool parameter verification - Verify params via get_tool_info before calling unfamiliar tools
Evidence grading - Grade all claims by evidence strength (T1-T4)
Citation requirements - Every fact must have inline source attribution
Mandatory completeness - All sections must exist with data minimums or explicit "No data" notes
Disambiguation first - Resolve all identifiers before research
Negative results documented - "No drugs found" is data; empty sections are failures
Collision-aware literature search - Detect and filter naming collisions
English-first queries - Always use English terms in tool calls, even if the user writes in another language. Translate gene names, disease names, and search terms to English. Only try original-language terms as a fallback if English returns no results. Respond in the user's language

相關技能

Comprehensive Target Intelligence Gatherer | Skills Pool

技能檔案

Comprehensive Target Intelligence Gatherer

wu-yc965 星標2026年3月6日

職業
分類: 生物信息學

技能內容

Gather complete target intelligence by exploring 9 parallel research paths. Supports targets identified by gene symbol, UniProt accession, Ensembl ID, or gene name.

KEY PRINCIPLES:

Report-first approach - Create report file FIRST, then populate progressively
Tool parameter verification - Verify params via get_tool_info before calling unfamiliar tools
Evidence grading - Grade all claims by evidence strength (T1-T4)
Citation requirements - Every fact must have inline source attribution
Mandatory completeness - All sections must exist with data minimums or explicit "No data" notes
Disambiguation first - Resolve all identifiers before research
Negative results documented - "No drugs found" is data; empty sections are failures
Collision-aware literature search - Detect and filter naming collisions
English-first queries - Always use English terms in tool calls, even if the user writes in another language. Translate gene names, disease names, and search terms to English. Only try original-language terms as a fallback if English returns no results. Respond in the user's language

相關技能

# Always check tool params to prevent silent failures
tool_info = tu.tools.get_tool_info(tool_name="Reactome_map_uniprot_to_pathways")
# Reveals: takes `id` not `uniprot_id`

Tool	WRONG Parameter	CORRECT Parameter
`Reactome_map_uniprot_to_pathways`	`uniprot_id`	`id`
`ensembl_get_xrefs`	`gene_id`	`id`
`GTEx_get_median_gene_expression`	`gencode_id` only	`gencode_id` + `operation="median"`
`OpenTargets_*`	`ensemblID`	`ensemblId` (camelCase)

# Step 1: Get gene info with version
gene_info = tu.tools.ensembl_lookup_gene(id=ensembl_id, species="human")
version = gene_info.get('version', 1)

# Step 2: Try versioned ID
versioned_id = f"{ensembl_id}.{version}"  # e.g., "ENSG00000123456.12"
result = tu.tools.GTEx_get_median_gene_expression(
    gencode_id=versioned_id,
    operation="median"
)

Tier	Symbol	Criteria	Examples
T1	★★★	Direct mechanistic evidence, human genetic proof	CRISPR KO, patient mutations, crystal structure with mechanism
T2	★★☆	Functional studies, model organism validation	siRNA phenotype, mouse KO, biochemical assay
T3	★☆☆	Association, screen hits, computational	GWAS hit, DepMap essentiality, expression correlation
T4	☆☆☆	Mention, review, text-mined, predicted	Review article, database annotation, computational prediction

---
**Evidence Quality for this Section**: Strong
- Mechanistic (T1): 12 papers
- Functional (T2): 8 papers
- Association (T3): 15 papers
- Mention (T4): 23 papers
**Data Gaps**: No CRISPR data; mouse KO phenotypes limited
---

EGFR mutations cause lung adenocarcinoma [★★★: PMID:15118125, activating mutations 
in patients]. *Source: ClinVar, CIViC*

Target Query (e.g., "EGFR" or "P00533")
│
├─ IDENTIFIER RESOLUTION (always first)
│   └─ Check if GPCR → GPCRdb_get_protein
│
├─ PATH 0: Open Targets Foundation (ALWAYS FIRST - fills gaps in all other paths)
│
├─ PATH 1: Core Identity (names, IDs, sequence, organism)
│   └─ InterProScan_scan_sequence for novel domain prediction (NEW)
├─ PATH 2: Structure & Domains (3D structure, domains, binding sites)
│   └─ If GPCR: GPCRdb_get_structures (active/inactive states)
├─ PATH 3: Function & Pathways (GO terms, pathways, biological role)
├─ PATH 4: Protein Interactions (PPI network, complexes)
├─ PATH 5: Expression Profile (tissue expression, single-cell)
├─ PATH 6: Variants & Disease (mutations, clinical significance)
│   └─ DisGeNET_search_gene for curated gene-disease associations
├─ PATH 7: Drug Interactions (known drugs, druggability, safety)
│   ├─ Pharos_get_target for TDL classification (Tclin/Tchem/Tbio/Tdark)
│   ├─ BindingDB_get_ligands_by_uniprot for known ligands (NEW)
│   ├─ PubChem_search_assays_by_target_gene for HTS data (NEW)
│   ├─ If GPCR: GPCRdb_get_ligands (curated agonists/antagonists)
│   └─ DepMap_get_gene_dependencies for target essentiality
└─ PATH 8: Literature & Research (publications, trends)

def resolve_target_ids(tu, query):
    """
    Resolve target query to ALL needed identifiers.
    Returns dict with: query, uniprot, ensembl, ensembl_version, symbol, 
    entrez, chembl_target, hgnc
    """
    ids = {
        'query': query, 
        'uniprot': None, 
        'ensembl': None, 
        'ensembl_versioned': None,  # For GTEx
        'symbol': None,
        'entrez': None,
        'chembl_target': None,
        'hgnc': None,
        'full_name': None,
        'synonyms': []
    }
    
    # [Resolution logic based on input type]
    # ... (see current implementation)
    
    # CRITICAL: Get versioned Ensembl ID for GTEx
    if ids['ensembl']:
        gene_info = tu.tools.ensembl_lookup_gene(id=ids['ensembl'], species="human")
        if gene_info and gene_info.get('version'):
            ids['ensembl_versioned'] = f"{ids['ensembl']}.{gene_info['version']}"
        
        # Also get synonyms for literature collision detection
        ids['full_name'] = gene_info.get('description', '').split(' [')[0]
    
    # Get UniProt alternative names for synonyms
    if ids['uniprot']:
        alt_names = tu.tools.UniProt_get_alternative_names_by_accession(accession=ids['uniprot'])
        if alt_names:
            ids['synonyms'].extend(alt_names)
    
    return ids

def check_gpcr_target(tu, ids):
    """
    Check if target is a GPCR and retrieve specialized data.
    Call after identifier resolution.
    """
    symbol = ids.get('symbol', '')
    
    # Build GPCRdb entry name
    entry_name = f"{symbol.lower()}_human"
    
    gpcr_info = tu.tools.GPCRdb_get_protein(
        operation="get_protein",
        protein=entry_name
    )
    
    if gpcr_info.get('status') == 'success':
        # Target is a GPCR - get specialized data
        
        # Get structures with receptor state
        structures = tu.tools.GPCRdb_get_structures(
            operation="get_structures",
            protein=entry_name
        )
        
        # Get known ligands (critical for binder projects)
        ligands = tu.tools.GPCRdb_get_ligands(
            operation="get_ligands",
            protein=entry_name
        )
        
        # Get mutation data
        mutations = tu.tools.GPCRdb_get_mutations(
            operation="get_mutations",
            protein=entry_name
        )
        
        return {
            'is_gpcr': True,
            'gpcr_family': gpcr_info['data'].get('family'),
            'gpcr_class': gpcr_info['data'].get('receptor_class'),
            'structures': structures.get('data', {}).get('structures', []),
            'ligands': ligands.get('data', {}).get('ligands', []),
            'mutations': mutations.get('data', {}).get('mutations', []),
            'ballesteros_numbering': True  # GPCRdb provides this
        }
    
    return {'is_gpcr': False}

### 2.x GPCR-Specific Data (GPCRdb)

**Receptor Class**: Class A (Rhodopsin-like)  
**GPCR Family**: Adrenoceptors  

**Structures by State**:
| PDB ID | State | Resolution | Ligand | Year |
|--------|-------|------------|--------|------|
| 3SN6 | Active | 3.2Å | Agonist (BI-167107) | 2011 |
| 2RH1 | Inactive | 2.4Å | Antagonist (carazolol) | 2007 |

**Known Ligands**: 45 agonists, 32 antagonists, 8 allosteric modulators  
**Key Binding Site Residues** (Ballesteros-Weinstein): 3.32, 5.42, 6.48, 7.39

def detect_collisions(tu, symbol, full_name):
    """
    Detect if gene symbol has naming collisions in literature.
    Returns negative filter terms if collisions found.
    """
    # Search by symbol in title
    results = tu.tools.PubMed_search_articles(
        query=f'"{symbol}"[Title]',
        limit=20
    )
    
    # Check if >20% are off-topic
    off_topic_terms = []
    for paper in results.get('articles', []):
        title = paper.get('title', '').lower()
        # Check if title mentions biology/protein/gene context
        bio_terms = ['protein', 'gene', 'cell', 'expression', 'mutation', 'kinase', 'receptor']
        if not any(term in title for term in bio_terms):
            # Extract potential collision terms
            # e.g., "JAK" might collide with "Just Another Kinase" jokes
            # e.g., "WDR7" might collide with other WDR family members in certain contexts
            pass
    
    # Build negative filter
    collision_filter = ""
    if off_topic_terms:
        collision_filter = " NOT " + " NOT ".join(off_topic_terms)
    
    return collision_filter

Endpoint	Section	Data Type
`OpenTargets_get_diseases_phenotypes_by_target_ensemblId`	8	Diseases/phenotypes
`OpenTargets_get_target_tractability_by_ensemblId`	9	Druggability assessment
`OpenTargets_get_target_safety_profile_by_ensemblId`	10	Safety liabilities
`OpenTargets_get_target_interactions_by_ensemblId`	6	PPI network
`OpenTargets_get_target_gene_ontology_by_ensemblId`	5	GO annotations
`OpenTargets_get_publications_by_target_ensemblId`	11	Literature
`OpenTargets_get_biological_mouse_models_by_ensemblId`	8/10	Mouse KO phenotypes
`OpenTargets_get_chemical_probes_by_target_ensemblId`	9	Chemical probes
`OpenTargets_get_associated_drugs_by_target_ensemblId`	9	Known drugs

def path_0_open_targets(tu, ids):
    """
    Open Targets foundation data - fills gaps for sections 5, 6, 8, 9, 10, 11.
    ALWAYS run this first.
    """
    ensembl_id = ids['ensembl']
    if not ensembl_id:
        return {'status': 'skipped', 'reason': 'No Ensembl ID'}
    
    results = {}
    
    # 1. Diseases & Phenotypes (Section 8)
    diseases = tu.tools.OpenTargets_get_diseases_phenotypes_by_target_ensemblId(
        ensemblId=ensembl_id
    )
    results['diseases'] = diseases if diseases else {'note': 'No disease associations returned'}
    
    # 2. Tractability (Section 9)
    tractability = tu.tools.OpenTargets_get_target_tractability_by_ensemblId(
        ensemblId=ensembl_id
    )
    results['tractability'] = tractability if tractability else {'note': 'No tractability data returned'}
    
    # 3. Safety Profile (Section 10)
    safety = tu.tools.OpenTargets_get_target_safety_profile_by_ensemblId(
        ensemblId=ensembl_id
    )
    results['safety'] = safety if safety else {'note': 'No safety liabilities identified'}
    
    # 4. Interactions (Section 6)
    interactions = tu.tools.OpenTargets_get_target_interactions_by_ensemblId(
        ensemblId=ensembl_id
    )
    results['interactions'] = interactions if interactions else {'note': 'No interactions returned'}
    
    # 5. GO Annotations (Section 5)
    go_terms = tu.tools.OpenTargets_get_target_gene_ontology_by_ensemblId(
        ensemblId=ensembl_id
    )
    results['go_terms'] = go_terms if go_terms else {'note': 'No GO annotations returned'}
    
    # 6. Publications (Section 11)
    publications = tu.tools.OpenTargets_get_publications_by_target_ensemblId(
        ensemblId=ensembl_id
    )
    results['publications'] = publications if publications else {'note': 'No publications returned'}
    
    # 7. Mouse Models (Section 8/10)
    mouse_models = tu.tools.OpenTargets_get_biological_mouse_models_by_ensemblId(
        ensemblId=ensembl_id
    )
    results['mouse_models'] = mouse_models if mouse_models else {'note': 'No mouse model data returned'}
    
    # 8. Chemical Probes (Section 9)
    probes = tu.tools.OpenTargets_get_chemical_probes_by_target_ensemblId(
        ensemblId=ensembl_id
    )
    results['chemical_probes'] = probes if probes else {'note': 'No chemical probes available'}
    
    # 9. Associated Drugs (Section 9)
    drugs = tu.tools.OpenTargets_get_associated_drugs_by_target_ensemblId(
        ensemblId=ensembl_id
    )
    results['drugs'] = drugs if drugs else {'note': 'No approved/trial drugs found'}
    
    return results

### 9.3 Chemical Probes

**Status**: No validated chemical probes available for this target.
*Source: OpenTargets_get_chemical_probes_by_target_ensemblId returned empty*

**Implication**: Tool compound development would be needed for chemical biology studies.

def path_structure_robust(tu, ids):
    """
    Robust structure search using 3-step chain.
    """
    structures = {'pdb': [], 'alphafold': None, 'domains': [], 'method_notes': []}
    
    # STEP 1: UniProt PDB Cross-References (most reliable)
    if ids['uniprot']:
        entry = tu.tools.UniProt_get_entry_by_accession(accession=ids['uniprot'])
        pdb_xrefs = [x for x in entry.get('uniProtKBCrossReferences', []) 
                    if x.get('database') == 'PDB']
        for xref in pdb_xrefs:
            pdb_id = xref.get('id')
            # Get details for each PDB
            pdb_info = tu.tools.get_protein_metadata_by_pdb_id(pdb_id=pdb_id)
            if pdb_info:
                structures['pdb'].append(pdb_info)
        structures['method_notes'].append(f"Step 1: {len(pdb_xrefs)} PDB cross-refs from UniProt")
    
    # STEP 2: Sequence-based PDB Search (catches missing annotations)
    if ids['uniprot'] and len(structures['pdb']) < 5:
        sequence = tu.tools.UniProt_get_sequence_by_accession(accession=ids['uniprot'])
        if sequence and len(sequence) < 1000:  # Reasonable length for search
            similar = tu.tools.PDB_search_similar_structures(
                sequence=sequence[:500],  # Use first 500 AA if long
                identity_cutoff=0.7
            )
            if similar:
                for hit in similar[:10]:  # Top 10 similar
                    if hit['pdb_id'] not in [s.get('pdb_id') for s in structures['pdb']]:
                        structures['pdb'].append(hit)
        structures['method_notes'].append(f"Step 2: Sequence search (identity ≥70%)")
    
    # STEP 3: Domain-based Search (for multi-domain proteins)
    if ids['uniprot']:
        domains = tu.tools.InterPro_get_protein_domains(uniprot_accession=ids['uniprot'])
        structures['domains'] = domains if domains else []
        
        # For large proteins with domains, search by domain sequence windows
        if len(structures['pdb']) < 3 and domains:
            for domain in domains[:3]:  # Top 3 domains
                domain_name = domain.get('name', '')
                # Could search PDB by domain name
                domain_hits = tu.tools.PDB_search_by_keyword(query=domain_name, limit=5)
                if domain_hits:
                    structures['method_notes'].append(f"Step 3: Domain '{domain_name}' search")
    
    # AlphaFold (always check)
    alphafold = tu.tools.alphafold_get_prediction(uniprot_accession=ids['uniprot'])
    structures['alphafold'] = alphafold if alphafold else {'note': 'No AlphaFold prediction'}
    
    # IMPORTANT: Document limitations
    if not structures['pdb']:
        structures['limitation'] = "No direct PDB hit does NOT mean no structure exists. Check: (1) structures under different UniProt entries, (2) homolog structures, (3) domain-only structures."
    
    return structures

### 4.1 Experimental Structures (PDB)

**Total PDB Entries**: 23 structures *(Source: UniProt cross-references)*
**Search Method**: 3-step chain (UniProt xrefs → sequence search → domain search)

| PDB ID | Resolution | Method | Ligand | Coverage | Year |
|--------|------------|--------|--------|----------|------|
| 1M17 | 2.6Å | X-ray | Erlotinib | 672-998 | 2002 |
| 3POZ | 2.8Å | X-ray | Gefitinib | 696-1022 | 2010 |

**Note**: "No direct PDB hit" ≠ "no structure exists". Check homologs and domain structures.

def path_expression(tu, ids):
    """
    Expression data with GTEx versioned ID fallback.
    """
    results = {'gtex': None, 'hpa': None, 'failed_tools': []}
    
    # GTEx with fallback
    ensembl_id = ids['ensembl']
    versioned_id = ids.get('ensembl_versioned')
    
    # Try unversioned first
    gtex_result = tu.tools.GTEx_get_median_gene_expression(
        gencode_id=ensembl_id,
        operation="median"
    )
    
    # Fallback to versioned if empty
    if not gtex_result or gtex_result.get('data') == []:
        if versioned_id:
            gtex_result = tu.tools.GTEx_get_median_gene_expression(
                gencode_id=versioned_id,
                operation="median"
            )
            if gtex_result and gtex_result.get('data'):
                results['gtex'] = gtex_result
                results['gtex_note'] = f"Used versioned ID: {versioned_id}"
        
        if not results.get('gtex'):
            results['failed_tools'].append({
                'tool': 'GTEx_get_median_gene_expression',
                'tried': [ensembl_id, versioned_id],
                'fallback': 'See HPA data below'
            })
    else:
        results['gtex'] = gtex_result
    
    # HPA (always query as backup)
    hpa_result = tu.tools.HPA_get_rna_expression_by_source(ensembl_id=ensembl_id)
    results['hpa'] = hpa_result if hpa_result else {'note': 'No HPA RNA data'}
    
    return results

def get_hpa_comprehensive_expression(tu, gene_symbol):
    """
    Get comprehensive expression data from Human Protein Atlas.
    
    Provides:
    - Tissue expression (protein and RNA)
    - Subcellular localization
    - Cell line expression comparison
    - Tissue specificity
    """
    
    # 1. Search for gene to get IDs
    gene_info = tu.tools.HPA_search_genes_by_query(search_query=gene_symbol)
    
    if not gene_info:
        return {'error': f'Gene {gene_symbol} not found in HPA'}
    
    # 2. Get tissue expression with specificity
    tissue_search = tu.tools.HPA_generic_search(
        search_query=gene_symbol,
        columns="g,gs,rnat,rnatsm,scml,scal",  # Gene, synonyms, tissue specificity, subcellular
        format="json"
    )
    
    # 3. Compare expression in cancer cell lines vs normal tissue
    cell_lines = ['a549', 'mcf7', 'hela', 'hepg2', 'pc3']
    cell_line_expression = {}
    
    for cell_line in cell_lines:
        try:
            expr = tu.tools.HPA_get_comparative_expression_by_gene_and_cellline(
                gene_name=gene_symbol,
                cell_line=cell_line
            )
            cell_line_expression[cell_line] = expr
        except:
            continue
    
    return {
        'gene_info': gene_info,
        'tissue_data': tissue_search,
        'cell_line_expression': cell_line_expression,
        'source': 'Human Protein Atlas'
    }

### Tissue Expression Profile (Human Protein Atlas)

| Tissue | Protein Level | RNA nTPM | Specificity |
|--------|---------------|----------|-------------|
| Brain | High | 45.2 | Enriched |
| Liver | Medium | 23.1 | Enhanced |
| Kidney | Low | 8.4 | Not detected |

**Subcellular Localization**: Cytoplasm, Plasma membrane

### Cancer Cell Line Expression

| Cell Line | Cancer Type | Expression | vs Normal |
|-----------|-------------|------------|-----------|
| A549 | Lung | High | Elevated |
| MCF7 | Breast | Medium | Similar |
| HeLa | Cervical | High | Elevated |

*Source: Human Protein Atlas via `HPA_search_genes_by_query`, `HPA_get_comparative_expression_by_gene_and_cellline`*

### 8.3 Clinical Variants (ClinVar)

#### Single Nucleotide Variants (SNVs)
| Variant | Clinical Significance | Condition | Review Status | PMID |
|---------|----------------------|-----------|---------------|------|
| p.L858R | Pathogenic | Lung cancer | 4 stars | 15118125 |
| p.T790M | Pathogenic | Drug resistance | 4 stars | 15737014 |

**Total Pathogenic SNVs**: 47

#### Copy Number Variants (CNVs) - Reported Separately
| Type | Region | Clinical Significance | Frequency |
|------|--------|----------------------|-----------|
| Amplification | 7p11.2 | Pathogenic | Common in cancer |

*Note: CNV data separated as it represents different mutation mechanism*

def get_disgenet_associations(tu, ids):
    """
    Get gene-disease associations from DisGeNET.
    Complements Open Targets with curated association scores.
    """
    symbol = ids.get('symbol')
    if not symbol:
        return {'status': 'skipped', 'reason': 'No gene symbol'}
    
    # Get all disease associations for gene
    gda = tu.tools.DisGeNET_search_gene(
        operation="search_gene",
        gene=symbol,
        limit=50
    )
    
    if gda.get('status') != 'success':
        return {'status': 'error', 'message': 'DisGeNET query failed'}
    
    associations = gda.get('data', {}).get('associations', [])
    
    # Categorize by evidence strength
    strong = []     # score >= 0.7
    moderate = []   # score 0.4-0.7  
    weak = []       # score < 0.4
    
    for assoc in associations:
        score = assoc.get('score', 0)
        disease_name = assoc.get('disease_name', '')
        umls_cui = assoc.get('disease_id', '')
        
        entry = {
            'disease': disease_name,
            'umls_cui': umls_cui,
            'score': score,
            'evidence_index': assoc.get('ei'),
            'dsi': assoc.get('dsi'),  # Disease Specificity Index
            'dpi': assoc.get('dpi')   # Disease Pleiotropy Index
        }
        
        if score >= 0.7:
            strong.append(entry)
        elif score >= 0.4:
            moderate.append(entry)
        else:
            weak.append(entry)
    
    return {
        'total_associations': len(associations),
        'strong_associations': strong,
        'moderate_associations': moderate,
        'weak_associations': weak[:10],  # Limit weak
        'disease_pleiotropy': len(associations)  # How many diseases linked
    }

### 8.x DisGeNET Gene-Disease Associations (NEW)

**Total Diseases Associated**: 47  
**Disease Pleiotropy Index**: High (gene linked to many disease types)

#### Strong Associations (Score ≥0.7)
| Disease | UMLS CUI | Score | Evidence Index |
|---------|----------|-------|----------------|
| Non-small cell lung cancer | C0007131 | 0.85 | 0.92 |
| Glioblastoma | C0017636 | 0.78 | 0.88 |

#### Moderate Associations (Score 0.4-0.7)
| Disease | UMLS CUI | Score | DSI |
|---------|----------|-------|-----|
| Breast cancer | C0006142 | 0.62 | 0.45 |

*Note: DisGeNET score integrates curated databases, GWAS, animal models, and literature*

def get_pharos_target_info(tu, ids):
    """
    Get Pharos/TCRD target development level and druggability.
    
    TDL Classification:
    - Tclin: Approved drug targets
    - Tchem: Targets with small molecule activities (IC50 < 30nM)
    - Tbio: Targets with biological annotations
    - Tdark: Understudied proteins
    """
    gene_symbol = ids.get('symbol')
    uniprot = ids.get('uniprot')
    
    # Try by gene symbol first
    if gene_symbol:
        result = tu.tools.Pharos_get_target(
            gene=gene_symbol
        )
    elif uniprot:
        result = tu.tools.Pharos_get_target(
            uniprot=uniprot
        )
    else:
        return {'status': 'error', 'message': 'Need gene symbol or UniProt'}
    
    if result.get('status') == 'success' and result.get('data'):
        target = result['data']
        return {
            'name': target.get('name'),
            'symbol': target.get('sym'),
            'tdl': target.get('tdl'),  # Tclin/Tchem/Tbio/Tdark
            'family': target.get('fam'),  # Kinase, GPCR, etc.
            'novelty': target.get('novelty'),
            'description': target.get('description'),
            'publications': target.get('publicationCount'),
            'interpretation': interpret_tdl(target.get('tdl'))
        }
    return None

def interpret_tdl(tdl):
    """Interpret Target Development Level for druggability."""
    interpretations = {
        'Tclin': 'Approved drug target - highest confidence for druggability',
        'Tchem': 'Small molecule active - good chemical tractability',
        'Tbio': 'Biologically characterized - may require novel modalities',
        'Tdark': 'Understudied - limited data, high novelty potential'
    }
    return interpretations.get(tdl, 'Unknown')

def search_disease_targets(tu, disease_name):
    """Find targets associated with a disease via Pharos."""
    
    result = tu.tools.Pharos_get_disease_targets(
        disease=disease_name,
        top=50
    )
    
    if result.get('status') == 'success':
        targets = result['data'].get('targets', [])
        # Group by TDL for prioritization
        by_tdl = {'Tclin': [], 'Tchem': [], 'Tbio': [], 'Tdark': []}
        for t in targets:
            tdl = t.get('tdl', 'Unknown')
            if tdl in by_tdl:
                by_tdl[tdl].append(t)
        return by_tdl
    return None

### 9.x Pharos/TCRD Target Classification (NEW)

**Target Development Level**: Tchem  
**Protein Family**: Kinase  
**Novelty Score**: 0.35 (moderately studied)  
**Publication Count**: 12,456

**TDL Interpretation**: Target has validated small molecule activities with IC50 < 30nM. Good chemical starting points exist.

**Disease Targets Analysis** (for disease-centric queries):
| TDL | Count | Examples |
|-----|-------|----------|
| Tclin | 12 | EGFR, ALK, RET |
| Tchem | 45 | KRAS, SHP2, CDK4 |
| Tbio | 78 | Novel kinases |
| Tdark | 23 | Understudied |

*Source: Pharos/TCRD via `Pharos_get_target`*

def assess_target_essentiality(tu, ids):
    """
    Is this target essential for cancer cell survival?
    
    Negative effect scores = gene is essential (cells die upon KO)
    """
    gene_symbol = ids.get('symbol')
    
    if not gene_symbol:
        return {'status': 'error', 'message': 'Need gene symbol'}
    
    deps = tu.tools.DepMap_get_gene_dependencies(
        gene_symbol=gene_symbol
    )
    
    if deps.get('status') == 'success':
        return {
            'gene': gene_symbol,
            'data': deps.get('data', {}),
            'interpretation': 'Negative scores indicate gene is essential for cell survival',
            'note': 'Score < -0.5 is strongly essential, < -1.0 is extremely essential'
        }
    return None

def get_cancer_type_essentiality(tu, gene_symbol, cancer_type):
    """Check if gene is essential in specific cancer type."""
    
    # Get cell lines for cancer type
    cell_lines = tu.tools.DepMap_get_cell_lines(
        cancer_type=cancer_type,
        page_size=20
    )
    
    return {
        'gene': gene_symbol,
        'cancer_type': cancer_type,
        'cell_lines': cell_lines.get('data', {}).get('cell_lines', []),
        'note': 'Query individual cell lines for dependency scores via DepMap portal'
    }

### 9.x Target Essentiality (DepMap) (NEW)

**Gene Essentiality Assessment**:
| Context | Effect Score | Interpretation |
|---------|--------------|----------------|
| Pan-cancer | -0.42 | Moderately essential |
| Lung cancer | -0.78 | Strongly essential |
| Breast cancer | -0.21 | Weakly essential |

**Selectivity**: Differential essentiality suggests cancer-type selective target

**Cell Lines Tested**: 1,054 cancer cell lines from DepMap

*Interpretation*: Score < -0.5 indicates strong dependency. This target is more essential in lung cancer than other cancer types - suggesting lung-selective targeting may be feasible.

*Source: DepMap via `DepMap_get_gene_dependencies`*

def predict_protein_domains(tu, sequence, title="Query protein"):
    """
    Run InterProScan for de novo domain prediction.
    
    Use when:
    - Protein has no InterPro annotations
    - Novel/uncharacterized protein
    - Custom sequence analysis
    """
    
    result = tu.tools.InterProScan_scan_sequence(
        sequence=sequence,
        title=title,
        go_terms=True,
        pathways=True
    )
    
    if result.get('status') == 'success':
        data = result.get('data', {})
        
        # Job may still be running
        if data.get('job_status') == 'RUNNING':
            return {
                'job_id': data.get('job_id'),
                'status': 'running',
                'note': 'Use InterProScan_get_job_results to retrieve when ready'
            }
        
        # Parse completed results
        return {
            'domains': data.get('domains', []),
            'domain_count': data.get('domain_count', 0),
            'go_annotations': data.get('go_annotations', []),
            'pathways': data.get('pathways', []),
            'sequence_length': data.get('sequence_length')
        }
    return None

def check_interproscan_job(tu, job_id):
    """Check status and get results for InterProScan job."""
    
    status = tu.tools.InterProScan_get_job_status(job_id=job_id)
    
    if status.get('data', {}).get('is_finished'):
        results = tu.tools.InterProScan_get_job_results(job_id=job_id)
        return results.get('data', {})
    
    return status.get('data', {})

### Domain Prediction (InterProScan) (NEW)

*Used for uncharacterized protein analysis*

**Predicted Domains**:
| Domain | Database | Start-End | E-value | InterPro Entry |
|--------|----------|-----------|---------|----------------|
| Protein kinase domain | Pfam | 45-305 | 1.2e-89 | IPR000719 |
| SH2 domain | SMART | 320-410 | 3.4e-45 | IPR000980 |

**Predicted GO Terms**:
- GO:0004672 protein kinase activity
- GO:0005524 ATP binding

**Predicted Pathways**:
- Reactome: Signal Transduction

*Source: InterProScan via `InterProScan_scan_sequence`*

def get_bindingdb_ligands(tu, uniprot_id, affinity_cutoff=10000):
    """
    Get ligands with measured binding affinities from BindingDB.
    
    Critical for:
    - Identifying chemical starting points
    - Understanding existing chemical matter
    - Assessing tractability with small molecules
    
    Args:
        uniprot_id: UniProt accession (e.g., P00533 for EGFR)
        affinity_cutoff: Maximum affinity in nM (lower = more potent)
    """
    
    # Get ligands by UniProt
    result = tu.tools.BindingDB_get_ligands_by_uniprot(
        uniprot=uniprot_id,
        affinity_cutoff=affinity_cutoff
    )
    
    if result:
        ligands = []
        for entry in result:
            ligands.append({
                'smiles': entry.get('smile'),
                'affinity_type': entry.get('affinity_type'),  # Ki, IC50, Kd
                'affinity_nM': entry.get('affinity'),
                'monomer_id': entry.get('monomerid'),
                'pmid': entry.get('pmid')
            })
        
        # Sort by affinity (most potent first)
        ligands.sort(key=lambda x: float(x['affinity_nM']) if x['affinity_nM'] else float('inf'))
        
        return {
            'total_ligands': len(ligands),
            'ligands': ligands[:20],  # Top 20 most potent
            'best_affinity': ligands[0]['affinity_nM'] if ligands else None
        }
    
    return {'total_ligands': 0, 'ligands': [], 'note': 'No ligands found in BindingDB'}

def get_ligands_by_structure(tu, pdb_id, affinity_cutoff=10000):
    """Get ligands for a protein by PDB structure ID."""
    
    result = tu.tools.BindingDB_get_ligands_by_pdb(
        pdb_ids=pdb_id,
        affinity_cutoff=affinity_cutoff,
        sequence_identity=100
    )
    
    return result

def find_compound_targets(tu, smiles, similarity_cutoff=0.85):
    """Find other targets for a compound (polypharmacology)."""
    
    result = tu.tools.BindingDB_get_targets_by_compound(
        smiles=smiles,
        similarity_cutoff=similarity_cutoff
    )
    
    return result

### Known Ligands (BindingDB) (NEW)

**Total Ligands with Binding Data**: 156
**Best Reported Affinity**: 0.3 nM (Ki)

#### Most Potent Ligands

| SMILES | Affinity Type | Value (nM) | Source PMID |
|--------|---------------|------------|-------------|
| CC(=O)Nc1ccc(cc1)c2... | Ki | 0.3 | 15737014 |
| CN(C)C/C=C/C(=O)Nc1... | IC50 | 0.8 | 15896103 |
| COc1cc2ncnc(Nc3ccc... | Kd | 2.1 | 16460808 |

**Chemical Tractability Assessment**:
- ✅ **Tchem-level target**: Multiple ligands with <30nM affinity
- ✅ **Diverse chemotypes**: Multiple scaffolds identified
- ✅ **Published literature**: Ligands have PMID references

*Source: BindingDB via `BindingDB_get_ligands_by_uniprot`*

Affinity Range	Interpretation	Drug Development Potential
<1 nM	Ultra-potent	Clinical compound likely
1-10 nM	Highly potent	Drug-like
10-100 nM	Potent	Good starting point
100-1000 nM	Moderate	Needs optimization
>1000 nM	Weak	Early hit only

def get_pubchem_assays_for_target(tu, gene_symbol):
    """
    Get bioassays targeting a gene from PubChem.
    
    Provides:
    - HTS screening results
    - Dose-response data (IC50/EC50)
    - Active compound counts
    """
    
    # Search assays by target gene
    assays = tu.tools.PubChem_search_assays_by_target_gene(
        gene_symbol=gene_symbol
    )
    
    assay_info = []
    if assays.get('data', {}).get('aids'):
        for aid in assays['data']['aids'][:10]:  # Top 10 assays
            # Get assay details
            summary = tu.tools.PubChem_get_assay_summary(aid=aid)
            targets = tu.tools.PubChem_get_assay_targets(aid=aid)
            
            assay_info.append({
                'aid': aid,
                'summary': summary.get('data', {}),
                'targets': targets.get('data', {})
            })
    
    return {
        'total_assays': len(assays.get('data', {}).get('aids', [])),
        'assay_details': assay_info
    }

def get_active_compounds_from_assay(tu, aid):
    """Get active compounds from a specific bioassay."""
    
    actives = tu.tools.PubChem_get_assay_active_compounds(aid=aid)
    
    return {
        'aid': aid,
        'active_cids': actives.get('data', {}).get('cids', []),
        'count': len(actives.get('data', {}).get('cids', []))
    }

### PubChem BioAssay Data (NEW)

**Assays Targeting This Gene**: 45

| AID | Assay Type | Active Compounds | Target Info |
|-----|------------|------------------|-------------|
| 1053104 | Dose-response | 12 | EGFR kinase |
| 504526 | HTS | 234 | EGFR binding |
| 651564 | Confirmatory | 8 | EGFR cellular |

**Total Active Compounds Across Assays**: ~500

*Source: PubChem via `PubChem_search_assays_by_target_gene`, `PubChem_get_assay_active_compounds`*

def path_literature_collision_aware(tu, ids):
    """
    Literature search with collision detection and filtering.
    """
    symbol = ids['symbol']
    full_name = ids.get('full_name', '')
    uniprot = ids['uniprot']
    synonyms = ids.get('synonyms', [])
    
    # Step 1: Detect collisions
    collision_filter = detect_collisions(tu, symbol, full_name)
    
    # Step 2: Build high-precision seed queries
    seed_queries = [
        f'"{symbol}"[Title] AND (protein OR gene OR expression)',  # Symbol in title
        f'"{full_name}"[Title]' if full_name else None,  # Full name in title
        f'"UniProt:{uniprot}"' if uniprot else None,  # UniProt accession
    ]
    seed_queries = [q for q in seed_queries if q]
    
    # Add key synonyms
    for syn in synonyms[:3]:
        seed_queries.append(f'"{syn}"[Title]')
    
    # Step 3: Execute seed queries and collect PMIDs
    seed_pmids = set()
    for query in seed_queries:
        if collision_filter:
            query = f"({query}){collision_filter}"
        results = tu.tools.PubMed_search_articles(query=query, limit=30)
        for article in results.get('articles', []):
            seed_pmids.add(article.get('pmid'))
    
    # Step 4: Expand via citation network (for sparse targets)
    if len(seed_pmids) < 30:
        expanded_pmids = set()
        for pmid in list(seed_pmids)[:10]:  # Top 10 seeds
            # Get related articles
            related = tu.tools.PubMed_get_related(pmid=pmid, limit=20)
            for r in related.get('articles', []):
                expanded_pmids.add(r.get('pmid'))
            
            # Get citing articles
            citing = tu.tools.EuropePMC_get_citations(pmid=pmid, limit=20)
            for c in citing.get('citations', []):
                expanded_pmids.add(c.get('pmid'))
        
        seed_pmids.update(expanded_pmids)
    
    # Step 5: Classify papers by evidence tier
    papers_by_tier = {'T1': [], 'T2': [], 'T3': [], 'T4': []}
    # ... classification logic based on title/abstract keywords
    
    return {
        'total_papers': len(seed_pmids),
        'collision_filter_applied': collision_filter if collision_filter else 'None needed',
        'seed_queries': seed_queries,
        'papers_by_tier': papers_by_tier
    }

def call_with_retry(tu, tool_name, params, max_retries=3):
    """
    Call tool with retry logic.
    """
    for attempt in range(max_retries):
        try:
            result = getattr(tu.tools, tool_name)(**params)
            if result and not result.get('error'):
                return result
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                return {'error': str(e), 'tool': tool_name, 'attempts': max_retries}
    return None

Primary Tool	Fallback 1	Fallback 2	Failure Action
`ChEMBL_get_target_activities`	`GtoPdb_get_target_ligands`	`OpenTargets drugs`	Note in report
`intact_get_interactions`	`STRING_get_protein_interactions`	`OpenTargets interactions`	Note in report
`GO_get_annotations_for_gene`	`OpenTargets GO`	`MyGene GO`	Note in report
`GTEx_get_median_gene_expression`	`HPA_get_rna_expression`	Note as unavailable	Document in report
`gnomad_get_gene_constraints`	`OpenTargets constraint`	-	Note in report
`DGIdb_get_drug_gene_interactions`	`OpenTargets drugs`	`GtoPdb`	Note in report

### 7.1 Tissue Expression

**GTEx Data**: Unavailable (API timeout after 3 attempts)
**Fallback Data (HPA)**:
| Tissue | Expression Level | Specificity |
|--------|-----------------|-------------|
| Liver | High | Enhanced |
| Kidney | Medium | - |

*Note: For complete GTEx data, query directly at gtexportal.org*

Section	Minimum Data	If Not Met
6. PPIs	≥20 interactors	Document which tools failed + why
7. Expression	Top 10 tissues with TPM + HPA RNA summary	Note "limited data" with specific gaps
8. Disease	Top 10 OT diseases + gnomAD constraints + ClinVar summary	Separate SNV/CNV; note if constraint unavailable
9. Druggability	OT tractability + probes + drugs + DGIdb + GtoPdb fallback	"No drugs/probes" is valid data
11. Literature	Total count + 5-year trend + 3-5 key papers with evidence tiers	Note if sparse (<50 papers)

## Completeness Audit (REQUIRED)

### Data Minimums Check
- [ ] PPIs: ≥20 interactors OR explanation why fewer
- [ ] Expression: Top 10 tissues with values OR explicit "unavailable"
- [ ] Diseases: Top 10 associations with scores OR "no associations"
- [ ] Constraints: All 4 scores (pLI, LOEUF, missense Z, pRec) OR "unavailable"
- [ ] Druggability: All modalities assessed; probes + drugs listed OR "none"

### Negative Results Documented
- [ ] Empty tool results noted explicitly (not left blank)
- [ ] Failed tools with fallbacks documented
- [ ] "No data" sections have implications noted

### Evidence Quality
- [ ] T1-T4 grades in Executive Summary disease claims
- [ ] T1-T4 grades in Disease Associations table
- [ ] Key papers table has evidence tiers
- [ ] Per-section evidence summaries included

### Source Attribution
- [ ] Every data point has source tool/database cited
- [ ] Section-end source summaries present

## 15. Data Gaps & Limitations

| Section | Expected Data | Actual | Reason | Alternative Source |
|---------|---------------|--------|--------|-------------------|
| 6. PPIs | ≥20 interactors | 8 | Novel target, limited studies | Literature review needed |
| 7. Expression | GTEx TPM | None | Versioned ID not recognized | See HPA data |
| 9. Probes | Chemical probes | None | No validated probes exist | Consider tool compound dev |

**Recommendations for Data Gaps**:
1. For PPIs: Query BioGRID with broader parameters; check yeast-2-hybrid studies
2. For Expression: Query GEO directly for tissue-specific datasets

# Target Intelligence Report: [TARGET NAME]

**Generated**: [Date] | **Query**: [Original query] | **Status**: In Progress

---

## 1. Executive Summary
[Researching...]
<!-- REQUIRED: 2-3 sentences, disease claims must have T1-T4 grades -->

## 2. Target Identifiers
[Researching...]
<!-- REQUIRED: UniProt, Ensembl (versioned), Entrez, ChEMBL, HGNC, Symbol -->

## 3. Basic Information
### 3.1 Protein Description
[Researching...]
### 3.2 Protein Function
[Researching...]
### 3.3 Subcellular Localization
[Researching...]

## 4. Structural Biology
### 4.1 Experimental Structures (PDB)
[Researching...]
<!-- METHOD: 3-step chain (UniProt xrefs → sequence search → domain search) -->
### 4.2 AlphaFold Prediction
[Researching...]
### 4.3 Domain Architecture
[Researching...]
### 4.4 Key Structural Features
[Researching...]

## 5. Function & Pathways
### 5.1 Gene Ontology Annotations
[Researching...]
<!-- REQUIRED: Evidence codes mapped to T1-T4 -->
### 5.2 Pathway Involvement
[Researching...]

## 6. Protein-Protein Interactions
[Researching...]
<!-- MINIMUM: ≥20 interactors OR explanation -->

## 7. Expression Profile
### 7.1 Tissue Expression (GTEx/HPA)
[Researching...]
<!-- NOTE: Use versioned Ensembl ID for GTEx if needed -->
### 7.2 Tissue Specificity
[Researching...]
<!-- MINIMUM: Top 10 tissues with TPM values -->

## 8. Genetic Variation & Disease
### 8.1 Constraint Scores
[Researching...]
<!-- REQUIRED: pLI, LOEUF, missense Z, pRec with interpretations -->
### 8.2 Disease Associations
[Researching...]
<!-- REQUIRED: Top 10 with OT scores; T1-T4 evidence grades -->
### 8.3 Clinical Variants (ClinVar)
[Researching...]
<!-- REQUIRED: Separate SNV and CNV tables -->
### 8.4 Mouse Model Phenotypes
[Researching...]

## 9. Druggability & Pharmacology
### 9.1 Tractability Assessment
[Researching...]
<!-- REQUIRED: All modalities (SM, Ab, PROTAC, other) -->
### 9.2 Known Drugs
[Researching...]
### 9.3 Chemical Probes
[Researching...]
<!-- NOTE: "No probes" is valid data - document explicitly -->
### 9.4 Clinical Pipeline
[Researching...]
### 9.5 ChEMBL Bioactivity
[Researching...]

## 10. Safety Profile
### 10.1 Safety Liabilities
[Researching...]
### 10.2 Expression-Based Toxicity Risk
[Researching...]
### 10.3 Mouse KO Phenotypes
[Researching...]

## 11. Literature & Research Landscape
### 11.1 Publication Metrics
[Researching...]
<!-- REQUIRED: Total, 5y, 1y, drug-related, clinical -->
### 11.2 Research Trend
[Researching...]
### 11.3 Key Publications
[Researching...]
<!-- REQUIRED: Table with PMID, title, year, evidence tier -->
### 11.4 Evidence Summary by Theme
[Researching...]
<!-- REQUIRED: T1-T4 breakdown per research theme -->

## 12. Competitive Landscape
[Researching...]

## 13. Summary & Recommendations
### 13.1 Target Validation Scorecard
[Researching...]
<!-- REQUIRED: 6 criteria, 1-5 scores, evidence quality noted -->
### 13.2 Strengths
[Researching...]
### 13.3 Challenges & Risks
[Researching...]
### 13.4 Recommendations
[Researching...]
<!-- REQUIRED: ≥3 prioritized (HIGH/MEDIUM/LOW) -->

## 14. Data Sources & Methodology
[Will be populated as research progresses...]

## 15. Data Gaps & Limitations
[To be populated post-audit...]

Tool	Parameter	Notes
`Reactome_map_uniprot_to_pathways`	`id`	NOT `uniprot_id`
`ensembl_get_xrefs`	`id`	NOT `gene_id`
`GTEx_get_median_gene_expression`	`gencode_id`, `operation`	Try versioned ID if empty
`OpenTargets_*`	`ensemblId`	camelCase, not `ensemblID`
`STRING_get_protein_interactions`	`protein_ids`, `species`	List format for IDs
`intact_get_interactions`	`identifier`	UniProt accession

Comprehensive Target Intelligence Gatherer

Comprehensive Target Intelligence Gatherer

Phase 0: Tool Parameter Verification (CRITICAL)

Known Parameter Corrections (Updated)

GTEx Versioned ID Fallback (CRITICAL)

When to Use This Skill

Critical Workflow Requirements

1. Report-First Approach (MANDATORY)

2. Evidence Grading System (MANDATORY)

Evidence Tiers

Required Evidence Grading Locations

Per-Section Evidence Summary

3. Citation Requirements (MANDATORY)

Core Strategy: 9 Research Paths

Identifier Resolution (Phase 1)

GPCR Target Detection (NEW)

Collision Detection for Literature Search

PATH 0: Open Targets Foundation (ALWAYS FIRST)

Path 0 Implementation

Negative Results Are Data

PATH 2: Structure & Domains (Enhanced)

3-Step Structure Search Chain

Structure Section Output Format

PATH 5: Expression Profile (Enhanced)

GTEx with Versioned ID Fallback

Human Protein Atlas - Extended Expression (NEW)

PATH 6: Variants & Disease (Enhanced)

6.1 ClinVar SNV vs CNV Separation

6.2 DisGeNET Integration (NEW)

PATH 7: Druggability & Target Validation (ENHANCED)

7.1 Pharos/TCRD - Target Development Level (NEW)

7.2 DepMap - Target Essentiality Validation (NEW)

7.3 InterProScan - Novel Domain Prediction (NEW)

7.4 BindingDB - Known Ligands & Binding Data (NEW)

7.5 PubChem BioAssay - Screening Data (NEW)

PATH 8: Literature & Research (Collision-Aware)

Collision-Aware Query Strategy

Retry Logic & Fallback Chains

Retry Policy

Fallback Chains (CRITICAL)

Failure Surfacing Rule

Per-Section Data Minimums & Completeness Audit

Minimum Data Requirements (Enforced)

Post-Run Completeness Audit

Data Gap Table (Required if minimums not met)

Report Template (Initial File)

Quick Reference: Tool Parameters

When NOT to Use This Skill

Nanoclaw Repl

Bioinformatics

Smart Explore

Vector Database Engineer

Skin Health Analyzer

Scanpy