How to use LangExtract to extract structured information from text. Use when writing code that calls lx.extract(), building extraction pipelines, defining examples, or troubleshooting alignment issues.
LangExtract extracts structured information from unstructured text using LLMs.
The main entry point is lx.extract(). Every extraction needs at least one
example and input text. prompt_description is optional but recommended.
pip install langextract
# For OpenAI model support:
pip install langextract[openai]
# Gemini (checks GEMINI_API_KEY then LANGEXTRACT_API_KEY)
export GEMINI_API_KEY="your_key"
# OpenAI (checks OPENAI_API_KEY then LANGEXTRACT_API_KEY)
export OPENAI_API_KEY="your_key"
# Ollama: no API key needed, just a running Ollama server
Auto-routing + env-default resolution is tuned for GPT-style model IDs
(gpt-4*, gpt-5*). For OpenAI-compatible endpoints or non-GPT IDs,
pass an explicit ModelConfig — see .
references/providers.mdimport langextract as lx
examples = [
lx.data.ExampleData(
text="Patient takes lisinopril 10mg daily for hypertension.",
extractions=[
lx.data.Extraction(
extraction_class="medication",
extraction_text="lisinopril",
attributes={"dose": "10mg", "frequency": "daily"},
),
lx.data.Extraction(
extraction_class="condition",
extraction_text="hypertension",
attributes={"status": "active"},
),
],
)
]
result = lx.extract(
text_or_documents="Patient is prescribed metformin 500mg twice daily.",
prompt_description="Extract medications and conditions with attributes.",
examples=examples,
model_id="gemini-2.5-flash",
)
for e in result.extractions:
print(e.extraction_class, e.extraction_text)
print(f" char_interval: {e.char_interval}")
print(f" attributes: {e.attributes}")
See examples/basic_extraction.py for a runnable version and
examples/relationship_extraction.py for using extraction_class +
attributes to encode relationships between entities.
Examples drive model behavior. Follow these rules:
extraction_text must be verbatim from the example text, not paraphrasedextraction_class should be consistent across examplesLangExtract can raise prompt-alignment warnings when an example's
extraction_text values don't align cleanly to the example's text
(failed alignment, or fuzzy/lesser rather than exact). The validator does
not enforce rules 2–4 above — those are guidance to steer the model's
output, not checked invariants.
To fail fast on alignment issues during development, pass these kwargs
directly to lx.extract() (both default to permissive):
from langextract.prompt_validation import PromptValidationLevel
result = lx.extract(
...,
prompt_validation_level=PromptValidationLevel.ERROR, # default: WARNING
prompt_validation_strict=True, # default: False
)
See references/prompt-validation.md for level semantics and strict-mode
behavior.
result = lx.extract(
text_or_documents=text, # str, URL, or list of Documents
prompt_description=prompt, # what to extract (optional)
examples=examples, # few-shot examples (required)
model_id="gemini-2.5-flash", # model to use
extraction_passes=1, # >1 for higher recall (costs more)
max_char_buffer=1000, # chunk size
batch_length=10, # chunks per batch
max_workers=10, # parallel workers (provider-dependent)
context_window_chars=None, # cross-chunk context for coreference
)
text_or_documents accepts a URL string (fetch_urls=True by default).batch_length >= max_workers. Parallel
processing via max_workers is provider-dependent; some providers
parallelize batched prompts, while others (such as the current Ollama
provider) process them sequentially.context_window_chars includes characters from the previous chunk as
context for the current one, which helps with coreference and entity
continuity across chunk boundaries.lx.data.Document — see
examples/multiple_documents.py.# Filter to grounded extractions only (have source positions)
grounded = [e for e in result.extractions if e.char_interval]
# Access source position
for e in grounded:
start = e.char_interval.start_pos
end = e.char_interval.end_pos
matched_text = result.text[start:end]
# Save to JSONL
lx.io.save_annotated_documents(
[result], output_name="results.jsonl", output_dir="."
)
# Visualize from JSONL (shows first document)
html = lx.visualize("results.jsonl")
with open("visualization.html", "w") as f:
if hasattr(html, "data"):
f.write(html.data) # Jupyter/Colab
else:
f.write(html)
# Or visualize an AnnotatedDocument directly
html = lx.visualize(result)
The default is Gemini. For other providers, see references/providers.md,
which covers:
model_url)ModelConfig for advanced provider_kwargs (custom base_url, etc.)router.register()Extractions with char_interval=None: the extraction could not be
located in the source text. Common causes include paraphrased output,
alignment misses, or hallucinated entities. Filter with
[e for e in result.extractions if e.char_interval]. To tune alignment,
see references/resolver-params.md.
Prompt alignment warnings: your examples have extraction_text that
doesn't match the example text verbatim. Fix the examples, or see
references/prompt-validation.md to fail fast during development.
Slow on long documents: extraction_passes > 1 multiplies processing
time. Start with 1 and increase only if recall is insufficient.
Model selection: gemini-2.5-flash is recommended for most tasks;
gemini-2.5-pro for complex reasoning.
references/providers.md — OpenAI, Ollama, ModelConfig, custom pluginsreferences/resolver-params.md — fuzzy alignment tuningreferences/prompt-validation.md — catching example issues earlyexamples/ — runnable scripts for each scenario above