Extract structured information from unstructured text using Google's LangExtract library with LLM-powered few-shot extraction, source grounding, and interactive visualization. Use this skill whenever the user wants to: extract entities/information from text documents, clinical notes, legal docs, or reports; parse unstructured text into structured data; do named entity recognition (NER) with custom entity types; extract relationships or attributes from documents; process long documents with chunking and parallel extraction; create interactive visualizations of extracted entities in their source context; use few-shot examples to define custom extraction schemas; work with LangExtract, langextract, or lx.extract(); structure text data from PDFs, articles, transcripts, or any text source.
Use the langextract Python library to extract structured information from unstructured text. LangExtract uses LLMs (Gemini, OpenAI, Ollama) with few-shot examples to identify entities, relationships, and attributes, mapping each extraction to its exact source location.
pip install langextract --break-system-packages
# For OpenAI support:
pip install "langextract[openai]" --break-system-packages
LangExtract requires an LLM API key for cloud models. The user MUST provide one via:
LANGEXTRACT_API_KEY env var (for Gemini — the default provider)OPENAI_API_KEY env var (for OpenAI models)api_key= parameter in lx.extract()If no API key is available, ask the user for one before writing the script. Ollama models (local) need no API key but require a running Ollama server.
Claude's environment can only reach specific domains. Gemini API (generativelanguage.googleapis.com) and OpenAI API (api.openai.com) may not be reachable. If extraction fails with network errors, inform the user the LLM endpoint is unreachable from this environment and suggest they run the generated script locally.
Every LangExtract task follows three steps:
Write a prompt_description string and one or more ExampleData objects.
The prompt tells the LLM what to extract. The examples show how — they are the schema definition via demonstration. This is the most important part; extraction quality depends almost entirely on prompt + example quality.
import langextract as lx
import textwrap
prompt = textwrap.dedent("""\
Extract [entity types] in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.""")
examples = [
lx.data.ExampleData(
text="[representative sample text]",
extractions=[
lx.data.Extraction(
extraction_class="[entity_type]",
extraction_text="[verbatim text from example]",
attributes={"key": "value"}
),
# ... more extractions, in order of appearance in text
]
)
]
Critical rules for examples:
extraction_text must be verbatim from the example text — no paraphrasingresult = lx.extract(
text_or_documents=input_text, # str, list[str], URL, or list[Document]
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash", # Default; see Model Selection below
)
lx.extract() — Full Parameter Reference:
| Parameter | Type | Default | Purpose |
|---|---|---|---|
text_or_documents | str / list / URL | REQUIRED | Input text, list of texts, URL, or list of Document objects |
prompt_description | str | None | Extraction instructions |
examples | list[ExampleData] | None | Few-shot examples defining the schema |
model_id | str | "gemini-2.5-flash" | LLM model identifier |
api_key | str | None | API key (falls back to env var) |
max_char_buffer | int | 1000 | Max characters per chunk. Smaller = better accuracy, more API calls |
extraction_passes | int | 1 | Number of passes. Higher = better recall, more cost. Use 2-3 for long docs |
max_workers | int | 10 | Parallel workers for chunk processing |
batch_length | int | 10 | Chunks per batch |
fence_output | bool | None | Set True for OpenAI models |
use_schema_constraints | bool | True | Set False for OpenAI and Ollama |
model_url | str | None | Ollama endpoint (e.g., "http://localhost:11434") |
language_model_params | dict | None | Provider-specific params (e.g., Vertex AI config) |
temperature | float | None | LLM temperature |
additional_context | str | None | Extra context prepended to the prompt |
debug | bool | False | Enable debug logging |
fetch_urls | bool | True | Auto-fetch URL content |
show_progress | bool | True | Show progress bar |
tokenizer | Tokenizer | None | Custom tokenizer (use for CJK languages) |
# Access extractions programmatically
for ext in result.extractions:
print(f"{ext.extraction_class}: '{ext.extraction_text}' — {ext.attributes}")
# Save to JSONL
lx.io.save_annotated_documents(
[result],
output_name="results.jsonl",
output_dir="."
)
# Generate interactive HTML visualization
html_content = lx.visualize("results.jsonl")
with open("visualization.html", "w") as f:
if hasattr(html_content, 'data'):
f.write(html_content.data)
else:
f.write(html_content)
The result object is an AnnotatedDocument with fields:
extractions: list of Extraction objects (each has extraction_class, extraction_text, attributes, char_interval)text: the original input textdocument_id: auto-generated ID| model_id | Provider | Extra Params Needed | Notes |
|---|---|---|---|
gemini-2.5-flash | Gemini | None (default) | Best balance of speed/cost/quality |
gemini-2.5-pro | Gemini | None | Better for complex reasoning tasks |
gpt-4o | OpenAI | fence_output=True, use_schema_constraints=False | Requires langextract[openai] |
gpt-4o-mini | OpenAI | fence_output=True, use_schema_constraints=False | Cheaper OpenAI option |
gemma2:2b | Ollama | model_url="http://localhost:11434", use_schema_constraints=False | Local, no API key |
For documents over ~5,000 characters, adjust these parameters:
result = lx.extract(
text_or_documents=long_text_or_url,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
extraction_passes=3, # Multiple passes improve recall
max_workers=20, # More parallelism for speed
max_char_buffer=1000, # Smaller chunks = better accuracy per chunk
)
Tuning guidelines:
max_char_buffer: 800-1000 for high accuracy, 2000-5000 for speed. Smaller means more chunks but each chunk gets more focused LLM attention.extraction_passes: 1 for quick/simple tasks, 2-3 for thorough extraction of long documents. Each pass costs proportionally more.max_workers: 10-20 is typical. More workers = faster but higher API concurrency. Watch rate limits.For cost optimization on large-scale tasks:
result = lx.extract(
...,
language_model_params={
"vertexai": True,
"project": "your-project-id",
"location": "global",
"batch": {"enabled": True}
}
)
Read references/prompt-patterns.md for detailed guidance on designing prompts and examples for different domains (medical, legal, financial, literary analysis, NER).
Common issues and fixes:
LANGEXTRACT_API_KEY or pass api_key=extraction_text doesn't appear verbatim in the example text. Fix the example.max_workers or add retry logic.