Name: LLM Observability Testing Skill
Author: DataDog

LLM Observability Testing Skill

This skill should be used when the user asks to "write LLMObs tests", "add tests for LLM Observability", "test an LLMObs plugin", "llmobs test", "llmobs spec", "test llm observability", "assertLlmObsSpanEvent", "useLlmObs", "getEvents", "MOCK_STRING", "MOCK_NOT_NULLISH", "MOCK_NUMBER", "MOCK_OBJECT", "VCR cassette", "record cassette", "replay cassette", "vcr proxy", "llmobs cassette", "test chat completions", "test streaming", "test embeddings", "test agent runs", "test orchestration", "test workflow", "llmobs span event", "LLMObs test strategy", "LlmObsCategory test", "LLM_CLIENT test", "MULTI_PROVIDER test", "ORCHESTRATION test", "INFRASTRUCTURE test", "span kind llm test", "span kind workflow test", "inputMessages", "outputMessages", "token metrics", "llmobs span validation", "cassette not generated", "re-record cassette", "127.0.0.1:9126", or needs to write, modify, or debug tests for any LLMObs plugin in dd-trace-js.

DataDog798 starsMar 19, 2026

Occupation
Categories: Lab Tools

⚠️ CRITICAL: Read This First ⚠️

BEFORE writing any test, you MUST determine the package category.

The category determines EVERYTHING:

Whether to use VCR or not
What spanKind to use
What test structure to follow
What examples to study

IF YOU USE THE WRONG CATEGORY STRATEGY, THE TEST WILL FAIL.

Categories are defined in the LlmObsCategory enum.

Quick check:

Does package make HTTP calls to LLM APIs? → LLM_CLIENT or MULTI_PROVIDER (use VCR)
Does package orchestrate workflows/graphs? → ORCHESTRATION (NO VCR, pure functions)
Does package implement protocols/servers? → INFRASTRUCTURE (mock servers)

See references/category-strategies.md for FORBIDDEN vs REQUIRED patterns per category.

LLM Observability Testing Skill

DataDog798 starsMar 19, 2026

Occupation
Categories: Lab Tools

⚠️ CRITICAL: Read This First ⚠️

BEFORE writing any test, you MUST determine the package category.

The category determines EVERYTHING:

Whether to use VCR or not

What spanKind to use

What test structure to follow

What examples to study

IF YOU USE THE WRONG CATEGORY STRATEGY, THE TEST WILL FAIL.

Categories are defined in the LlmObsCategory enum.

Quick check:

Does package make HTTP calls to LLM APIs? → LLM_CLIENT or MULTI_PROVIDER (use VCR)

Does package orchestrate workflows/graphs? → ORCHESTRATION (NO VCR, pure functions)

Does package implement protocols/servers? → INFRASTRUCTURE (mock servers)

See references/category-strategies.md for FORBIDDEN vs REQUIRED patterns per category.

LLM Observability Testing Skill

⚠️ CRITICAL: Read This First ⚠️

LLM Observability Testing Skill

⚠️ CRITICAL: Read This First ⚠️

Purpose

When to Use

Core Testing Concepts

1. Test Structure

2. VCR Cassettes

3. Category-Specific Test Strategies

LlmObsCategory.LLM_CLIENT & LlmObsCategory.MULTI_PROVIDER

LlmObsCategory.ORCHESTRATION

LlmObsCategory.INFRASTRUCTURE

4. Assertion Patterns

Test File Organization

Key Testing Points

Coverage Requirements

Span Kind Validation

Error Handling

Common Patterns by Category

LLM_CLIENT / MULTI_PROVIDER Pattern

ORCHESTRATION Pattern

INFRASTRUCTURE Pattern

Best Practices

References

Key Principles

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio