Use Online Evaluations (LLM-as-a-judge) to automatically score AI Config responses for accuracy, relevance, and toxicity.
Automatically score AI Config responses using LLM-as-a-judge methodology.
aiconfig-sdk)Three built-in judges are available:
Judges evaluate asynchronously (1-2 minute delay). Results appear in the Monitoring tab.
from ldclient import Context
from ldclient.config import Config
from ldai.client import LDAIClient, AICompletionConfigDefault
import ldclient
# Initialize (see aiconfig-sdk)
ldclient.set_config(Config("your-sdk-key"))
ld_client = ldclient.get()
ai_client = LDAIClient(ld_client)
def check_judges(ai_client, config_key: str, user_id: str):
"""Check which judges are attached to a config."""
context = Context.builder(user_id).build()
config = ai_client.completion_config(
config_key,
context,
AICompletionConfigDefault(enabled=False),
{}
)
if config.judge_configuration and config.judge_configuration.judges:
print("[OK] Judges attached:")
for judge in config.judge_configuration.judges:
print(f" - {judge.key}: {int(judge.sampling_rate * 100)}% sampling")
else:
print("[INFO] No judges configured")
return config.judge_configuration
For automatic judge evaluation, use the create_chat() method. This handles the full conversation flow and triggers judges automatically.
Important:
create_chat()passes model parameters directly to the provider. LaunchDarkly uses camelCase (maxTokens), but OpenAI expects snake_case (max_tokens). If your variation hasmaxTokensset,create_chat()will fail with OpenAI. Either:
- Omit
maxTokensfrom the variation's model parameters, OR- Use
completion_config()+track_openai_metrics()instead (but judges won't auto-evaluate)
from ldai.client import AICompletionConfigDefault, ModelConfig, ProviderConfig, LDMessage
async def generate_with_automatic_evaluation(ai_client, config_key: str, user_id: str, prompt: str):
"""Generate AI response with automatic judge evaluation using create_chat."""
context = Context.builder(user_id).build()
chat = await ai_client.create_chat(
config_key,
context,
AICompletionConfigDefault(
enabled=True,
model=ModelConfig("gpt-4"),
provider=ProviderConfig("openai"),
messages=[LDMessage(role='system', content='You are a helpful assistant.')]
)
)
if not chat:
return None
# Invoke chat - judges evaluate automatically (1-2 min delay)
response = await chat.invoke(prompt)
# Results appear in Monitoring tab as:
# $ld:ai:judge:accuracy, $ld:ai:judge:relevance, $ld:ai:judge:toxicity
return response.message.content
Configure sampling rates in the LaunchDarkly UI:
| Environment | Rate | Use Case |
|---|---|---|
| Development | 100% | Full evaluation for testing |
| Staging | 50% | Validation coverage |
| Production (initial) | 10% | Start conservatively |
| Production (stable) | 20% | Ongoing monitoring |
| Critical features | 30% | Important flows |
ld_client.flush() in serverless environmentsaiconfig-sdk - SDK setup and config retrievalaiconfig-ai-metrics - Automatic AI metrics trackingaiconfig-variations - Manage variations