Skill File

Workflow Metrics

Name: Workflow Metrics
Author: samelhousseini

Track, evaluate, and analyze multi-agent AI workflows using Azure AI Foundry cloud evaluations. Use when logging agent trajectories, computing evaluation metrics, measuring plan adherence, or generating workflow analytics.

samelhousseini0 starsJan 21, 2026

Occupation
Categories: CI/CD

Skill Content

Workflow Metrics Skill

Folder Contents

File	Type	Description
`SKILL.md`	Documentation	Main skill documentation with 14 evaluation metrics, trajectory format, and decorator usage
`PRD.md`	Documentation	Product Requirements Document for the skill
`.env.sample`	Configuration	Sample environment variables (AZURE_AI_PROJECT_ENDPOINT)
`requirements.txt`	Dependencies	Python package dependencies (azure-ai-projects, azure-identity)
scripts/workflow_metrics/
`scripts/workflow_metrics/__init__.py`	Module	Package initializer with exports for all classes and decorators

Related Skills

Workflow Metrics | Skills Pool

scripts/workflow_metrics/trajectory.py

.github/skills/workflow-metrics/scripts/workflow_metrics/

from workflow_metrics import FoundryEvalsClient, WorkflowTrajectoryLog

# 1. Log a trajectory
trajectory = WorkflowTrajectoryLog(
    user_query="What's the status of order W123456?",
    system_prompt="You are a helpful customer service agent."
)
trajectory.log_plan("Check order status", ["get_order_status"])
trajectory.log_step(tool_name="get_order_status", arguments={"order_id": "W123456"}, result={"status": "shipped"})
trajectory.finalize("Your order W123456 has been shipped!")

# 2. Create cloud eval run
client = FoundryEvalsClient()

# Create eval group with metrics
eval_id = client.create_eval(
    name="order_workflow_eval",
    metrics=["coherence", "task_completion", "tool_call_accuracy"]
)

# Submit the workflow for cloud evaluation
workflow_data = trajectory.to_dict()
run_id = client.run_eval(
    eval_id=eval_id,
    items=[client._prepare_workflow_data(workflow_data)],
    run_name="order_eval_run"
)

# Wait for cloud completion and get results
results = client.wait_for_completion(eval_id=eval_id, run_id=run_id)
print(results)

Metric	Evaluator	Description	Required Fields
`coherence`	`builtin.coherence`	Logical flow and consistency	query, response
`fluency`	`builtin.fluency`	Grammar and readability	response
`relevance`	`builtin.relevance`	Response relevance to query	query, response
`groundedness`	`builtin.groundedness`	Factual accuracy based on context	response (context, query, tool_definitions optional)

Metric	Evaluator	Description	Required Fields
`intent_resolution`	`builtin.intent_resolution`	How well agent understood intent	query, response
`task_adherence`	`builtin.task_adherence`	Following task requirements	query, response
`task_completion`	`builtin.task_completion`	Successfully completing tasks	query, response
`response_completeness`	`builtin.response_completeness`	Completeness of response	ground_truth, response

Metric	Evaluator	Description	Required Fields
`tool_call_accuracy`	`builtin.tool_call_accuracy`	Correct tool selection	query, tool_definitions
`tool_input_accuracy`	`builtin.tool_input_accuracy`	Correct tool inputs	query, response, tool_definitions
`tool_output_utilization`	`builtin.tool_output_utilization`	Effective use of tool results	query, response
`tool_selection`	`builtin.tool_selection`	Appropriate tool choices	query, response, tool_definitions
`tool_success`	`builtin.tool_call_success`	Tool execution success rate	response

from workflow_metrics import FoundryEvalsClient

client = FoundryEvalsClient(
    endpoint=None,              # Uses AZURE_AI_PROJECT_ENDPOINT env var
    deployment_name=None,       # Uses AZURE_AI_MODEL_DEPLOYMENT_NAME env var (default: "gpt-4.1")
    use_reasoning_model=False,  # Set True for o-series models
    threshold=3.0,              # Score threshold for pass/fail (1-5 scale)
    matching_mode="in_order_match"  # For task_navigation_efficiency
)

eval_id = client.create_eval(
    name="my_agent_eval",
    metrics=["coherence", "task_completion", "tool_call_accuracy"]
)
# Returns: "eval_abc123..."

# Builds data_source_config schema based on required fields
data_source_config = {
    "type": "custom",
    "item_schema": {
        "type": "object",
        "properties": {...},  # Fields needed by selected metrics
        "required": [...]
    },
    "include_sample_schema": True
}

# Builds testing_criteria for each metric
testing_criteria = [
    {
        "type": "azure_ai_evaluator",
        "name": "coherence",
        "evaluator_name": "builtin.coherence",
        "initialization_parameters": {
            "deployment_name": "gpt-4.1"
        },
        "data_mapping": {
            "query": "{{item.query}}",
            "response": "{{item.response}}"
        }
    },
    # ... more criteria
]

eval_obj = client.evals.create(
    name=name,
    data_source_config=data_source_config,
    testing_criteria=testing_criteria
)

run_id = client.run_eval(
    eval_id=eval_id,
    items=prepared_items,  # List of workflow data dicts
    run_name="my_run_001",
    metadata={"source": "production", "version": "1.0"}
)
# Returns: "run_xyz789..."

from openai.types.evals.create_eval_jsonl_run_data_source_param import (
    CreateEvalJSONLRunDataSourceParam,
    SourceFileContent,
    SourceFileContentContent,
)

# Build content items (filter out internal tracking fields)
content_items = []
for item in items:
    eval_item = {k: v for k, v in item.items() if not k.startswith("_")}
    content_items.append(SourceFileContentContent(item=eval_item))

# Create the eval run with inline JSONL data
eval_run = client.evals.runs.create(
    eval_id=eval_id,
    name=run_name or f"run_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    metadata=metadata or {"source": "FoundryEvalsClient"},
    data_source=CreateEvalJSONLRunDataSourceParam(
        type="jsonl",
        source=SourceFileContent(
            type="file_content",      # Inline content, not file reference
            content=content_items,    # List of SourceFileContentContent
        ),
    ),
)

result = client.wait_for_completion(
    eval_id=eval_id,
    run_id=run_id,
    timeout=300,        # Max wait in seconds
    poll_interval=5     # Seconds between polls
)
# Returns: {
#     "status": "completed" | "failed" | "timeout",
#     "output_items": [...],
#     "report_url": "https://...",
#     "eval_id": "...",
#     "run_id": "..."
# }

results = client.evaluate_workflows(
    workflows=workflow_list,    # List of workflow dicts with foundry_format
    metrics=["coherence", "task_completion"],
    batch_size=50,              # Max workflows per eval run
    timeout=300                 # Max wait per batch
)
# Returns: {
#     "workflow_run_id_1": {
#         "coherence": {"score": 4.0, "pass": True, "reason": "..."},
#         "task_completion": {"score": 5.0, "pass": True, "reason": "..."}
#     },
#     ...
# }

result = client.evaluate_single(
    workflow=workflow_dict,
    metrics=["coherence", "tool_call_accuracy"],
    timeout=60
)

workflow = {
    "workflow_id": "wf_001",
    "workflow_run_id": "run_001",
    "user_request": "Check order status",
    "foundry_format": {
        "query": "What is the status of order W123?",
        "response": [
            {"role": "assistant", "content": "Let me check that for you."},
            {"role": "assistant", "content": [
                {
                    "type": "tool_call",
                    "tool_call_id": "tc_1",
                    "name": "get_order_status",
                    "arguments": {"order_id": "W123"}
                }
            ]},
            {"role": "tool", "content": [
                {
                    "type": "tool_result",
                    "tool_call_id": "tc_1",
                    "tool_result": {"status": "shipped", "tracking": "1Z999"}
                }
            ]},
            {"role": "assistant", "content": "Your order W123 has been shipped! Tracking: 1Z999"}
        ],
        "tool_definitions": [
            {
                "name": "get_order_status",
                "description": "Get the status of an order",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "order_id": {"type": "string", "description": "Order ID"}
                    },
                    "required": ["order_id"]
                }
            }
        ],
        "ground_truth": ["get_order_status"]  # Expected tool sequence
    }
}

client = FoundryEvalsClient()

# Select metrics to evaluate
metrics = [
    "coherence",           # Text quality
    "task_completion",     # Did agent complete the task?
    "tool_call_accuracy",  # Did agent call the right tools?
    "tool_input_accuracy"  # Were tool inputs correct?
]

eval_id = client.create_eval(
    name=f"workflow_eval_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    metrics=metrics
)
print(f"Created eval group: {eval_id}")

# Prepare items for the API
prepared_items = [client._prepare_workflow_data(w) for w in workflows]

# Submit to Azure AI Foundry cloud
run_id = client.run_eval(
    eval_id=eval_id,
    items=prepared_items,
    run_name="batch_1",
    metadata={"environment": "production", "version": "2.0"}
)
print(f"Submitted eval run: {run_id}")

# Poll until completion (runs in Azure cloud)
completion = client.wait_for_completion(
    eval_id=eval_id,
    run_id=run_id,
    timeout=300,
    poll_interval=5
)

if completion["status"] == "completed":
    print(f"Eval completed! Report: {completion['report_url']}")

    # Parse results
    results = client._parse_results(
        completion["output_items"],
        prepared_items,
        metrics
    )

    for workflow_id, metric_results in results.items():
        print(f"\nWorkflow: {workflow_id}")
        for metric, result in metric_results.items():
            print(f"  {metric}: {result['score']:.1f} ({'PASS' if result['pass'] else 'FAIL'})")

Workflow Metrics

Workflow Metrics Skill

Folder Contents

Workflow Metrics

Workflow Metrics Skill

Folder Contents

Table of Contents

Overview

Module Location

Quick Start: Cloud Eval Runs

Available Cloud Metrics

Text Quality Metrics

Agent Quality Metrics

Tool Usage Metrics

FoundryEvalsClient: Complete API Reference

Initialization

Core Methods

`create_eval(name, metrics) -> str`

`run_eval(eval_id, items, run_name, metadata) -> str`

`wait_for_completion(eval_id, run_id, timeout, poll_interval) -> Dict`

`evaluate_workflows(workflows, metrics, batch_size, timeout) -> Dict`

`evaluate_single(workflow, metrics, timeout) -> Dict`

Creating Cloud Eval Runs Step-by-Step

Step 1: Prepare Workflow Data

Step 2: Create Eval Group

Step 3: Submit Eval Run

Step 4: Wait for Results

Feishu Wiki

Coding Agent (bash-first)

Prose

Clawhub

Openai Whisper

Sherpa Onnx Tts