Evaluate multimodal AI agents that process images, audio, PDFs, or other files. Sets up evaluations using LangWatch's LLM-as-judge with image inputs, Scenario's multimodal testing, and document parsing evaluation patterns. Use when your agent handles non-text inputs.
This recipe helps you evaluate agents that process images, audio, PDFs, or other non-text inputs.
Read the codebase to understand what your agent processes:
Use the langwatch CLI to fetch the right pages:
langwatch scenario-docs # Index — locate multimodal pages
langwatch scenario-docs multimodal/audio-to-text # Audio testing patterns
langwatch scenario-docs multimodal/multimodal-files # Generic file analysis patterns
langwatch docs # LangWatch docs index
langwatch docs evaluations/experiments/sdk # Experiment SDK basics
langwatch docs evaluations/evaluators/list # Browse evaluator types
For PDF evaluation specifically, reference the pattern from python-sdk/examples/pdf_parsing_evaluation.ipynb:
LangWatch's LLM-as-judge evaluators can accept images. Create an evaluation that:
import langwatch
experiment = langwatch.experiment.init("image-eval")
for idx, entry in experiment.loop(enumerate(image_dataset)):
result = my_agent(image=entry["image_path"])
experiment.evaluate(
"llm_boolean",
index=idx,
data={
"input": entry["image_path"], # LLM-as-judge can view images
"output": result,
},
settings={
"model": "openai/gpt-5-mini",
"prompt": "Does the agent correctly describe/classify this image?",
},
)
Use Scenario's audio testing patterns:
Read the dedicated guide:
langwatch scenario-docs multimodal/audio-to-text
Follow the pattern from the PDF parsing evaluation example:
For agents that process arbitrary files, read the file analysis guide:
langwatch scenario-docs multimodal/multimodal-files
For each modality, generate or collect test data that matches the agent's actual use case:
Run the evaluation, review results, fix issues, re-run until quality is acceptable.
langwatch scenario-docs ... page for the modality before writing code; multimodal patterns differ a lot from text-only ones