Name: Langfuse
Author: sickn33

Question: {question}
Response: {response}

Output only a number between 0 and 1.
"""

result = openai.chat.completions.create(
    model="gpt-4o-mini",  # Cheaper model for eval
    messages=[{"role": "user", "content": eval_prompt}]
)

return float(result.choices[0].message.content.strip())

# Link to dataset item
trace = langfuse.trace(name="eval-run")
trace.generation(
    name="response",
    input=item.input,
    output=response
)

# Score against expected
similarity = calculate_similarity(response, item.expected_output)
trace.score(name="similarity", value=similarity)

# Link trace to dataset item
item.link(trace, "eval-run-1")

result = process(user_input)

# Score the trace
langfuse_context.score_current_trace(
    name="success",
    value=1 if result else 0
)

return result

1. Build agent with LangGraph
2. Add Langfuse callback handler
3. Trace all LLM calls and tool uses
4. Score outputs for quality
5. Monitor and iterate

1. Build RAG with retrieval and generation
2. Trace retrieval and LLM calls
3. Score relevance and accuracy
4. Track costs and latency
5. Optimize based on data

1. Build agent with structured outputs
2. Create evaluation dataset
3. Run evaluations with traces
4. Compare prompt versions
5. Deploy best performers

Langfuse | Skills Pool

Langfuse

Langfuse

Expertise

Capabilities

Prerequisites

Scope

Ecosystem

Primary

Common_integrations

Platforms

Patterns

Basic Tracing Setup

Initialize client

Create a trace for a user request

Log a generation (LLM call)

Make actual LLM call

Complete the generation with output

Score the trace

Flush before exit (important in serverless)

OpenAI Integration

Drop-in replacement for OpenAI client

All calls automatically traced

Works with streaming

Works with async

LangChain Integration

Create Langfuse callback handler

Use with any LangChain component

Pass handler to invoke

Or set as default

Then all calls are traced

Works with agents, retrievers, etc.

Prompt Management

Fetch prompt from Langfuse

(Create in UI or via API first)

Get compiled prompt with variables

Use with OpenAI

Link generation to prompt version

Create/update prompts via API

Fetch specific label

Evaluation and Scoring

Manual scoring in code

After getting response

LLM-as-judge evaluation

Score asynchronously

Create evaluation dataset

Add items to dataset

Run evaluation on dataset

Decorator Pattern

Add metadata and scores

Works with async

Collaboration

Delegation Triggers

Observable LangGraph Agent

Monitored RAG Pipeline

Evaluated Agent System

Related Skills

When to Use

Limitations

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags