Track, evaluate, and analyze multi-agent AI workflows using Azure AI Foundry cloud evaluations. Use when logging agent trajectories, computing evaluation metrics, measuring plan adherence, or generating workflow analytics.
| File | Type | Description |
|---|---|---|
SKILL.md | Documentation | Main skill documentation with 14 evaluation metrics, trajectory format, and decorator usage |
PRD.md | Documentation | Product Requirements Document for the skill |
.env.sample | Configuration | Sample environment variables (AZURE_AI_PROJECT_ENDPOINT) |
requirements.txt | Dependencies | Python package dependencies (azure-ai-projects, azure-identity) |
| scripts/workflow_metrics/ | ||
scripts/workflow_metrics/__init__.py | Module | Package initializer with exports for all classes and decorators |
scripts/workflow_metrics/trajectory.py| Logging |
| WorkflowTrajectoryLog class for multi-step agent trajectory logging in Foundry-compatible format |
scripts/workflow_metrics/metrics.py | Metrics | TrajectoryMetrics class for calculating metrics from trajectory files |
scripts/workflow_metrics/foundry_evals.py | Evaluator | FoundryEvalsClient for Azure AI Foundry cloud evaluation API with 14 evaluation metrics |
scripts/workflow_metrics/telemetry.py | Decorators | @track_workflow decorator and context functions (log_plan, log_step, auto_log_step) |
scripts/workflow_metrics/trajectory_manager.py | Manager | EvaluationManager singleton for daily metrics files management |
scripts/workflow_metrics/tool_definitions_helper.py | Helper | Generate tool definitions from Python functions with optional LLM enrichment |
scripts/workflow_metrics/utils.py | Utilities | Utility functions for workflow execution and common operations |
Track, evaluate, and analyze multi-agent AI workflows using Azure AI Foundry cloud evaluations.
This skill provides cloud-based evaluation of AI agent workflows via Azure AI Foundry Evals API.
CRITICAL: All evaluations (13 LLM-as-judge metrics) MUST be run in Azure AI Foundry cloud. Local SDK-based evaluations are NOT supported. Always use
FoundryEvalsClientto create cloud eval runs.
The evaluation flow:
.github/skills/workflow-metrics/scripts/workflow_metrics/
from workflow_metrics import FoundryEvalsClient, WorkflowTrajectoryLog
# 1. Log a trajectory
trajectory = WorkflowTrajectoryLog(
user_query="What's the status of order W123456?",
system_prompt="You are a helpful customer service agent."
)
trajectory.log_plan("Check order status", ["get_order_status"])
trajectory.log_step(tool_name="get_order_status", arguments={"order_id": "W123456"}, result={"status": "shipped"})
trajectory.finalize("Your order W123456 has been shipped!")
# 2. Create cloud eval run
client = FoundryEvalsClient()
# Create eval group with metrics
eval_id = client.create_eval(
name="order_workflow_eval",
metrics=["coherence", "task_completion", "tool_call_accuracy"]
)
# Submit the workflow for cloud evaluation
workflow_data = trajectory.to_dict()
run_id = client.run_eval(
eval_id=eval_id,
items=[client._prepare_workflow_data(workflow_data)],
run_name="order_eval_run"
)
# Wait for cloud completion and get results
results = client.wait_for_completion(eval_id=eval_id, run_id=run_id)
print(results)
All 13 metrics below run in Azure AI Foundry cloud as LLM-as-judge evaluations:
| Metric | Evaluator | Description | Required Fields |
|---|---|---|---|
coherence | builtin.coherence | Logical flow and consistency | query, response |
fluency | builtin.fluency | Grammar and readability | response |
relevance | builtin.relevance | Response relevance to query | query, response |
groundedness | builtin.groundedness | Factual accuracy based on context | response (context, query, tool_definitions optional) |
| Metric | Evaluator | Description | Required Fields |
|---|---|---|---|
intent_resolution | builtin.intent_resolution | How well agent understood intent | query, response |
task_adherence | builtin.task_adherence | Following task requirements | query, response |
task_completion | builtin.task_completion | Successfully completing tasks | query, response |
response_completeness | builtin.response_completeness | Completeness of response | ground_truth, response |
| Metric | Evaluator | Description | Required Fields |
|---|---|---|---|
tool_call_accuracy | builtin.tool_call_accuracy | Correct tool selection | query, tool_definitions |
tool_input_accuracy | builtin.tool_input_accuracy | Correct tool inputs | query, response, tool_definitions |
tool_output_utilization | builtin.tool_output_utilization | Effective use of tool results | query, response |
tool_selection | builtin.tool_selection | Appropriate tool choices | query, response, tool_definitions |
tool_success | builtin.tool_call_success | Tool execution success rate | response |
The FoundryEvalsClient class in foundry_evals.py is the primary interface for cloud evaluations.
from workflow_metrics import FoundryEvalsClient
client = FoundryEvalsClient(
endpoint=None, # Uses AZURE_AI_PROJECT_ENDPOINT env var
deployment_name=None, # Uses AZURE_AI_MODEL_DEPLOYMENT_NAME env var (default: "gpt-4.1")
use_reasoning_model=False, # Set True for o-series models
threshold=3.0, # Score threshold for pass/fail (1-5 scale)
matching_mode="in_order_match" # For task_navigation_efficiency
)
create_eval(name, metrics) -> strCreates an evaluation group with specified metrics (testing criteria).
eval_id = client.create_eval(
name="my_agent_eval",
metrics=["coherence", "task_completion", "tool_call_accuracy"]
)
# Returns: "eval_abc123..."
What happens internally:
# Builds data_source_config schema based on required fields
data_source_config = {
"type": "custom",
"item_schema": {
"type": "object",
"properties": {...}, # Fields needed by selected metrics
"required": [...]
},
"include_sample_schema": True
}
# Builds testing_criteria for each metric
testing_criteria = [
{
"type": "azure_ai_evaluator",
"name": "coherence",
"evaluator_name": "builtin.coherence",
"initialization_parameters": {
"deployment_name": "gpt-4.1"
},
"data_mapping": {
"query": "{{item.query}}",
"response": "{{item.response}}"
}
},
# ... more criteria
]
eval_obj = client.evals.create(
name=name,
data_source_config=data_source_config,
testing_criteria=testing_criteria
)
run_eval(eval_id, items, run_name, metadata) -> strSubmits items for cloud evaluation using inline JSONL data.
run_id = client.run_eval(
eval_id=eval_id,
items=prepared_items, # List of workflow data dicts
run_name="my_run_001",
metadata={"source": "production", "version": "1.0"}
)
# Returns: "run_xyz789..."
Critical: The API call structure:
from openai.types.evals.create_eval_jsonl_run_data_source_param import (
CreateEvalJSONLRunDataSourceParam,
SourceFileContent,
SourceFileContentContent,
)
# Build content items (filter out internal tracking fields)
content_items = []
for item in items:
eval_item = {k: v for k, v in item.items() if not k.startswith("_")}
content_items.append(SourceFileContentContent(item=eval_item))
# Create the eval run with inline JSONL data
eval_run = client.evals.runs.create(
eval_id=eval_id,
name=run_name or f"run_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
metadata=metadata or {"source": "FoundryEvalsClient"},
data_source=CreateEvalJSONLRunDataSourceParam(
type="jsonl",
source=SourceFileContent(
type="file_content", # Inline content, not file reference
content=content_items, # List of SourceFileContentContent
),
),
)
wait_for_completion(eval_id, run_id, timeout, poll_interval) -> DictPolls until evaluation completes and retrieves results.
result = client.wait_for_completion(
eval_id=eval_id,
run_id=run_id,
timeout=300, # Max wait in seconds
poll_interval=5 # Seconds between polls
)
# Returns: {
# "status": "completed" | "failed" | "timeout",
# "output_items": [...],
# "report_url": "https://...",
# "eval_id": "...",
# "run_id": "..."
# }
evaluate_workflows(workflows, metrics, batch_size, timeout) -> DictHigh-level method that handles the full evaluation flow.
results = client.evaluate_workflows(
workflows=workflow_list, # List of workflow dicts with foundry_format
metrics=["coherence", "task_completion"],
batch_size=50, # Max workflows per eval run
timeout=300 # Max wait per batch
)
# Returns: {
# "workflow_run_id_1": {
# "coherence": {"score": 4.0, "pass": True, "reason": "..."},
# "task_completion": {"score": 5.0, "pass": True, "reason": "..."}
# },
# ...
# }
evaluate_single(workflow, metrics, timeout) -> DictEvaluate a single workflow.
result = client.evaluate_single(
workflow=workflow_dict,
metrics=["coherence", "tool_call_accuracy"],
timeout=60
)
Workflows must have a foundry_format section:
workflow = {
"workflow_id": "wf_001",
"workflow_run_id": "run_001",
"user_request": "Check order status",
"foundry_format": {
"query": "What is the status of order W123?",
"response": [
{"role": "assistant", "content": "Let me check that for you."},
{"role": "assistant", "content": [
{
"type": "tool_call",
"tool_call_id": "tc_1",
"name": "get_order_status",
"arguments": {"order_id": "W123"}
}
]},
{"role": "tool", "content": [
{
"type": "tool_result",
"tool_call_id": "tc_1",
"tool_result": {"status": "shipped", "tracking": "1Z999"}
}
]},
{"role": "assistant", "content": "Your order W123 has been shipped! Tracking: 1Z999"}
],
"tool_definitions": [
{
"name": "get_order_status",
"description": "Get the status of an order",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string", "description": "Order ID"}
},
"required": ["order_id"]
}
}
],
"ground_truth": ["get_order_status"] # Expected tool sequence
}
}
client = FoundryEvalsClient()
# Select metrics to evaluate
metrics = [
"coherence", # Text quality
"task_completion", # Did agent complete the task?
"tool_call_accuracy", # Did agent call the right tools?
"tool_input_accuracy" # Were tool inputs correct?
]
eval_id = client.create_eval(
name=f"workflow_eval_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
metrics=metrics
)
print(f"Created eval group: {eval_id}")
# Prepare items for the API
prepared_items = [client._prepare_workflow_data(w) for w in workflows]
# Submit to Azure AI Foundry cloud
run_id = client.run_eval(
eval_id=eval_id,
items=prepared_items,
run_name="batch_1",
metadata={"environment": "production", "version": "2.0"}
)
print(f"Submitted eval run: {run_id}")
# Poll until completion (runs in Azure cloud)
completion = client.wait_for_completion(
eval_id=eval_id,
run_id=run_id,
timeout=300,
poll_interval=5
)
if completion["status"] == "completed":
print(f"Eval completed! Report: {completion['report_url']}")
# Parse results
results = client._parse_results(
completion["output_items"],
prepared_items,
metrics
)
for workflow_id, metric_results in results.items():
print(f"\nWorkflow: {workflow_id}")
for metric, result in metric_results.items():
print(f" {metric}: {result['score']:.1f} ({'PASS' if result['pass'] else 'FAIL'})")