Skill File

LangExtract Skill

Name: LangExtract Skill
Author: larryflorio

Extract structured information from unstructured text using Google's LangExtract library with LLM-powered few-shot extraction, source grounding, and interactive visualization. Use this skill whenever the user wants to: extract entities/information from text documents, clinical notes, legal docs, or reports; parse unstructured text into structured data; do named entity recognition (NER) with custom entity types; extract relationships or attributes from documents; process long documents with chunking and parallel extraction; create interactive visualizations of extracted entities in their source context; use few-shot examples to define custom extraction schemas; work with LangExtract, langextract, or lx.extract(); structure text data from PDFs, articles, transcripts, or any text source.

larryflorio0 starsFeb 11, 2026

Occupation
Categories: Documents

Skill Content

Use the langextract Python library to extract structured information from unstructured text. LangExtract uses LLMs (Gemini, OpenAI, Ollama) with few-shot examples to identify entities, relationships, and attributes, mapping each extraction to its exact source location.

Installation

pip install langextract --break-system-packages
# For OpenAI support:
pip install "langextract[openai]" --break-system-packages

API Key Requirement

LangExtract requires an LLM API key for cloud models. The user MUST provide one via:

LANGEXTRACT_API_KEY env var (for Gemini — the default provider)
OPENAI_API_KEY env var (for OpenAI models)
Direct api_key= parameter in lx.extract()

If no API key is available, ask the user for one before writing the script. Ollama models (local) need no API key but require a running Ollama server.

Network Constraint

Related Skills

LangExtract Skill | Skills Pool

import langextract as lx
import textwrap

prompt = textwrap.dedent("""\
    Extract [entity types] in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

examples = [
    lx.data.ExampleData(
        text="[representative sample text]",
        extractions=[
            lx.data.Extraction(
                extraction_class="[entity_type]",
                extraction_text="[verbatim text from example]",
                attributes={"key": "value"}
            ),
            # ... more extractions, in order of appearance in text
        ]
    )
]

result = lx.extract(
    text_or_documents=input_text,    # str, list[str], URL, or list[Document]
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",     # Default; see Model Selection below
)

Parameter	Type	Default	Purpose
`text_or_documents`	str / list / URL	REQUIRED	Input text, list of texts, URL, or list of `Document` objects
`prompt_description`	str	None	Extraction instructions
`examples`	list[ExampleData]	None	Few-shot examples defining the schema
`model_id`	str	`"gemini-2.5-flash"`	LLM model identifier
`api_key`	str	None	API key (falls back to env var)
`max_char_buffer`	int	1000	Max characters per chunk. Smaller = better accuracy, more API calls
`extraction_passes`	int	1	Number of passes. Higher = better recall, more cost. Use 2-3 for long docs
`max_workers`	int	10	Parallel workers for chunk processing
`batch_length`	int	10	Chunks per batch
`fence_output`	bool	None	Set `True` for OpenAI models
`use_schema_constraints`	bool	True	Set `False` for OpenAI and Ollama
`model_url`	str	None	Ollama endpoint (e.g., `"http://localhost:11434"`)
`language_model_params`	dict	None	Provider-specific params (e.g., Vertex AI config)
`temperature`	float	None	LLM temperature
`additional_context`	str	None	Extra context prepended to the prompt
`debug`	bool	False	Enable debug logging
`fetch_urls`	bool	True	Auto-fetch URL content
`show_progress`	bool	True	Show progress bar
`tokenizer`	Tokenizer	None	Custom tokenizer (use for CJK languages)

# Access extractions programmatically
for ext in result.extractions:
    print(f"{ext.extraction_class}: '{ext.extraction_text}' — {ext.attributes}")

# Save to JSONL
lx.io.save_annotated_documents(
    [result],
    output_name="results.jsonl",
    output_dir="."
)

# Generate interactive HTML visualization
html_content = lx.visualize("results.jsonl")
with open("visualization.html", "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data)
    else:
        f.write(html_content)

model_id	Provider	Extra Params Needed	Notes
`gemini-2.5-flash`	Gemini	None (default)	Best balance of speed/cost/quality
`gemini-2.5-pro`	Gemini	None	Better for complex reasoning tasks
`gpt-4o`	OpenAI	`fence_output=True, use_schema_constraints=False`	Requires `langextract[openai]`
`gpt-4o-mini`	OpenAI	`fence_output=True, use_schema_constraints=False`	Cheaper OpenAI option
`gemma2:2b`	Ollama	`model_url="http://localhost:11434", use_schema_constraints=False`	Local, no API key

result = lx.extract(
    text_or_documents=long_text_or_url,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,       # Multiple passes improve recall
    max_workers=20,            # More parallelism for speed
    max_char_buffer=1000,      # Smaller chunks = better accuracy per chunk
)

result = lx.extract(
    ...,
    language_model_params={
        "vertexai": True,
        "project": "your-project-id",
        "location": "global",
        "batch": {"enabled": True}
    }
)

LangExtract Skill

Installation

API Key Requirement

Network Constraint

LangExtract Skill

Installation

API Key Requirement

Network Constraint

Core Workflow

1. Define the Extraction Task

2. Run Extraction

3. Save and Visualize Results

Model Selection

Tuning for Long Documents

Vertex AI Batch Processing

Prompt and Example Design Patterns

Error Handling

When to Generate a Script vs. Run Inline

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing