Extract structured entities from unstructured text using few-shot prompting with LLMs. Use when you need to extract people, places, events, relationships, emotions, or any structured information from text documents. Supports custom entity types, attributes, and relationships. Works with any text: articles, documents, conversations, social media, or domain-specific content.
Extract structured, grounded entities from unstructured text using few-shot prompting.
LangExtract's approach: Show, don't tell. Instead of complex instructions, provide 1-2 examples of the exact extraction format you want. The LLM learns the pattern and applies it to new text.
User request: "Extract people and events from this meeting transcript"
Your workflow:
Examples
Q: Today I met with Sarah Johnson at the downtown office. We discussed the Q4 launch plan.
A: {
"extractions": [
{
"人物": "Sarah Johnson",
"人物_attributes": {}
},
{
"地点": "downtown office",
"地点_attributes": {}
},
{
"事件": "Q4 launch plan discussion",
"事件_attributes": {}
}
]
}
Q: [USER'S TEXT HERE]
A:
Send to LLM and parse JSON response
Align extractions to source text (find character positions)
That's it. No complex prompts needed.
{
"extractions": [
{
"类别": "提取的文本",
"类别_attributes": {
"key": "value"
}
}
]
}
Common entity types (customize as needed):
Attributes provide structured metadata:
{
"人物": "张三",
"人物_attributes": {
"role": "工程师",
"department": "研发部"
}
}
Model relationships between entities:
{
"关系": "张三是李四的经理",
"关系_attributes": {
"type": "管理关系",
"source": "张三",
"target": "李四",
"direction": "manages"
}
}
Ask clarifying questions:
Build ONE high-quality example showing:
Example text should be similar to target text (same domain, style, length).
[Optional: Brief instruction]
Examples
Q: [Example text]
A: [Example JSON with extractions]
Q: [Target text - can be chunked if long]
A:
Key principles:
extractions arrayFormat options:
For text > 3000 chars, split into chunks:
Context injection:
[Previous text]: ...end of previous chunk...
Examples
Q: [Example]
A: [Example JSON]
Q: [Current chunk]
A:
For higher recall, run extraction multiple times:
Medical records:
{
"症状": "头痛",
"症状_attributes": {
"severity": "中度",
"duration": "3天",
"frequency": "间歇性"
}
}
Legal documents:
{
"条款": "违约责任",
"条款_attributes": {
"section": "第5条",
"type": "义务性条款",
"parties": ["甲方", "乙方"]
}
}
Social media:
{
"话题": "#AI技术",
"话题_attributes": {
"sentiment": "正面",
"engagement": "高"
}
}
After extraction, map each entity back to source text:
# Find exact text in source
position = source_text.find(extraction_text)
if position >= 0:
char_interval = (position, position + len(extraction_text))
When exact match fails:
Use case: Extract names, places, dates
Example:
{
"extractions": [
{"人物": "Alice", "人物_attributes": {}},
{"地点": "Paris", "地点_attributes": {}},
{"时间": "2024-01-15", "时间_attributes": {}}
]
}
Use case: Detailed entity metadata
Example:
{
"extractions": [
{
"产品": "iPhone 15",
"产品_attributes": {
"category": "手机",
"price": "5999",
"features": ["A17芯片", "钛金属边框"]
}
}
]
}
Use case: Build entity relationships
Example:
{
"extractions": [
{"人物": "张三", "人物_attributes": {"role": "CEO"}},
{"人物": "李四", "人物_attributes": {"role": "CTO"}},
{
"关系": "张三管理李四",
"关系_attributes": {
"type": "reports_to",
"source": "李四",
"target": "张三"
}
}
]
}
✅ Do:
❌ Don't:
Good example:
Bad example:
Structured attributes (preferred):
"人物_attributes": {
"age": "30",
"occupation": "engineer"
}
Unstructured attributes (avoid):
"人物_attributes": {
"info": "30 years old engineer"
}
提取结果:
============================================================
【人物】(3 个)
------------------------------------------------------------
• 张三
描述: 项目经理
属性: {"department": "研发部"}
【事件】(2 个)
------------------------------------------------------------
• 项目启动会
描述: 讨论Q1目标
{
"document_id": "doc_001",
"text": "原始文本...",
"extractions": [
{
"extraction_class": "人物",
"extraction_text": "张三",
"char_interval": {"start": 10, "end": 12},
"description": "项目经理",
"attributes": {"department": "研发部"}
}
]
}
Generate interactive HTML with:
See EXAMPLES.md for complete working examples across different domains.