Name: Paper Search Workflow
Author: conradry

Search skills.../

Paper Search Workflow | Skills Pool

("KRAS" AND "A549") AND ("CRISPR" OR "knockout" OR "Cas9")

import sys; sys.path.insert(0, "scripts")
from db_connect import get_db, search_text

db = get_db()

# search_text() calls .to_pandas() then str.contains() — safe for small tables
hits = search_text(db, "publications", "section_text", "<query_terms>")

# Or manual pandas for more complex filters:
pubs_df = db.open_table("publications").to_pandas()
results = pubs_df[
    pubs_df["section_text"].str.contains("<query_terms>", case=False, na=False)
]

datasets = db.open_table("datasets")
# By accession (scalar filter — works on S3)
ds = datasets.search().where("accession_id = 'GSE12345'").to_pandas()

# By text on dataset_description (small table — pandas OK)
ds_df = datasets.to_pandas()
hits = ds_df[ds_df["dataset_description"].str.contains("<query_terms>", case=False, na=False)]

gene_expr = db.open_table("gene_expression")
# Use scalar .where() with LIKE — never call .to_pandas() without filter
cells = gene_expr.search().where(
    "perturbation_search_string LIKE '%GENE_ID:<gene_index>%'"
).limit(100).to_pandas()

pubs_df = db.open_table("publications").to_pandas()
ds_df   = db.open_table("datasets").to_pandas()

# Find papers matching query, then get their datasets
paper_pmids = pubs_df[
    pubs_df["title"].str.contains("<query>", case=False, na=False)
]["pmid"].unique()

related_datasets = ds_df[ds_df["pmid"].isin(paper_pmids)]

{
  "query_used": "<structured query object>",
  "sources_searched": ["lancedb"],
  "sources_unavailable": [],
  "total_results": "<number>",
  "candidates": [
    {
      "rank": 1,
      "paper_id": "<DOI>",
      "title": "<paper title>",
      "authors": ["<first author et al.>"],
      "year": 2024,
      "abstract": "<abstract text>",
      "perturbation_type": "<chemical|genetic_crispr|genetic_rnai|combinatorial>",
      "organism": "<species>",
      "cell_types": ["<cell types mentioned>"],
      "data_accessions": ["GSE12345"],
      "data_available": true,
      "citation_count": 45,
      "source": "lancedb",
      "open_access": true
    }
  ],
  "search_metadata": {
    "timestamp": "<ISO 8601>",
    "query_terms": ["<search terms used>"],
    "filters_applied": {}
  }
}

Paper Search Workflow

Purpose

When to Use

IMPORTANT: LanceDB is the ONLY data source

Workflow Steps

Step 1: Prepare Search Queries

Paper Search Workflow

Purpose

When to Use

IMPORTANT: LanceDB is the ONLY data source

Workflow Steps

Step 1: Prepare Search Queries

Step 2: Query LanceDB

Step 3: Enrich with Data Availability

Step 4: Rank Candidates

Step 5: Return Candidate List

Error Handling

Dependencies

Deep Research

Data Analyst

Academic Researcher

Data Scientist

Biopython

Binary Analysis Patterns