Build and run LLM-powered data processing pipelines with DocETL. Use when users say "docetl", want to analyze unstructured data, process documents, extract information, or run ETL tasks on text. Helps with data collection, pipeline creation, execution, and optimization.
DocETL is a system for creating LLM-powered data processing pipelines. This skill helps you build end-to-end pipelines: from data preparation to execution and optimization.
Work like a data analyst: write → run → inspect → iterate. Never write all scripts at once and run them all at once. Each phase should be completed and validated before moving to the next.
sample: 10-20 for testingsample parameter and run full pipelineVisualization Aesthetics:
Report Structure:
Interactive Tables:
Source Document Links:
Key principle: The user should see results at every step. Don't proceed to the next phase until the current phase produces good results.
DocETL datasets must be JSON arrays or CSV files.
[
{"id": 1, "text": "First document content...", "metadata": "value"},
{"id": 2, "text": "Second document content...", "metadata": "value"}
]
id,text,metadata
1,"First document content...","value"
2,"Second document content...","value"
If user needs to collect data, write a Python script:
import json
# Collect/transform data
documents = []
for source in sources:
documents.append({
"id": source.id,
"text": source.content, # DO NOT truncate text
# Add relevant fields
})
# Save as DocETL dataset
with open("dataset.json", "w") as f:
json.dump(documents, f, indent=2)
Important: Never truncate document text in collection scripts. DocETL operations like split handle long documents properly. Truncation loses information.
Always run the collection script and inspect results before proceeding. Show the user:
import json
data = json.load(open("dataset.json"))
print(f"Total documents: {len(data)}")
print(f"Keys: {list(data[0].keys())}")
print(f"Avg length: {sum(len(str(d)) for d in data) // len(data)} chars")
# Show sample
print("\nSample document:")
print(json.dumps(data[0], indent=2)[:500])
Only proceed to pipeline development once the data looks correct.
CRITICAL: Before writing any prompts, READ the actual input data to understand:
import json
with open("dataset.json") as f:
data = json.load(f)
# Examine several examples
for doc in data[:5]:
print(doc)
This understanding is essential for writing specific, effective prompts.
Create a YAML file with this structure:
default_model: gpt-5-nano
system_prompt:
dataset_description: <describe the data based on what you observed>
persona: <role for the LLM to adopt>