Expert in Langfuse - the open-source LLM observability platform. Covers tracing, prompt management, evaluation, datasets, and integration with LangChain, LlamaIndex, and OpenAI. Essential for debugging, monitoring, and improving LLM applications in production.
Expert in Langfuse - the open-source LLM observability platform. Covers tracing, prompt management, evaluation, datasets, and integration with LangChain, LlamaIndex, and OpenAI. Essential for debugging, monitoring, and improving LLM applications in production.
Role: LLM Observability Architect
You are an expert in LLM observability and evaluation. You think in terms of traces, spans, and metrics. You know that LLM applications need monitoring just like traditional software - but with different dimensions (cost, quality, latency). You use data to drive prompt improvements and catch regressions.
Instrument LLM calls with Langfuse
When to use: Any LLM application
from langfuse import Langfuse
langfuse = Langfuse( public_key="pk-...", secret_key="sk-...", host="https://cloud.langfuse.com" # or self-hosted URL )
trace = langfuse.trace( name="chat-completion", user_id="user-123", session_id="session-456", # Groups related traces metadata={"feature": "customer-support"}, tags=["production", "v2"] )
generation = trace.generation( name="gpt-4o-response", model="gpt-4o", model_parameters={"temperature": 0.7}, input={"messages": [{"role": "user", "content": "Hello"}]}, metadata={"attempt": 1} )
response = openai.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello"}] )
generation.end( output=response.choices[0].message.content, usage={ "input": response.usage.prompt_tokens, "output": response.usage.completion_tokens } )
trace.score( name="user-feedback", value=1, # 1 = positive, 0 = negative comment="User clicked helpful" )
langfuse.flush()
Automatic tracing with OpenAI SDK
When to use: OpenAI-based applications
from langfuse.openai import openai
response = openai.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello"}], # Langfuse-specific parameters name="greeting", # Trace name session_id="session-123", user_id="user-456", tags=["test"], metadata={"feature": "chat"} )
stream = openai.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Tell me a story"}], stream=True, name="story-generation" )
for chunk in stream: print(chunk.choices[0].delta.content, end="")
import asyncio from langfuse.openai import AsyncOpenAI
async_client = AsyncOpenAI()
async def main(): response = await async_client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello"}], name="async-greeting" )
Trace LangChain applications
When to use: LangChain-based applications
from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langfuse.callback import CallbackHandler
langfuse_handler = CallbackHandler( public_key="pk-...", secret_key="sk-...", host="https://cloud.langfuse.com", session_id="session-123", user_id="user-456" )
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful assistant."), ("user", "{input}") ])
chain = prompt | llm
response = chain.invoke( {"input": "Hello"}, config={"callbacks": [langfuse_handler]} )
import langchain langchain.callbacks.manager.set_handler(langfuse_handler)
response = chain.invoke({"input": "Hello"})
from langchain.agents import create_openai_tools_agent
agent = create_openai_tools_agent(llm, tools, prompt) agent_executor = AgentExecutor(agent=agent, tools=tools)
result = agent_executor.invoke( {"input": "What's the weather?"}, config={"callbacks": [langfuse_handler]} )
Version and deploy prompts
When to use: Managing prompts across environments
from langfuse import Langfuse
langfuse = Langfuse()
prompt = langfuse.get_prompt("customer-support-v2")
compiled = prompt.compile( customer_name="John", issue="billing question" )
response = openai.chat.completions.create( model=prompt.config.get("model", "gpt-4o"), messages=compiled, temperature=prompt.config.get("temperature", 0.7) )
trace = langfuse.trace(name="support-chat") generation = trace.generation( name="response", model="gpt-4o", prompt=prompt # Links to specific version )
langfuse.create_prompt( name="customer-support-v3", prompt=[ {"role": "system", "content": "You are a support agent..."}, {"role": "user", "content": "{{user_message}}"} ], config={ "model": "gpt-4o", "temperature": 0.7 }, labels=["production"] # or ["staging", "development"] )
prompt = langfuse.get_prompt( "customer-support-v3", label="production" # Gets latest with this label )
Evaluate LLM outputs systematically
When to use: Quality assurance and improvement
from langfuse import Langfuse
langfuse = Langfuse()
trace = langfuse.trace(name="qa-flow")
trace.score( name="relevance", value=0.85, # 0-1 scale comment="Response addressed the question" )
trace.score( name="correctness", value=1, # Binary: 0 or 1 data_type="BOOLEAN" )
def evaluate_response(question: str, response: str) -> float: eval_prompt = f""" Rate the response quality from 0 to 1.
Question: {question}
Response: {response}
Output only a number between 0 and 1.
"""
result = openai.chat.completions.create(
model="gpt-4o-mini", # Cheaper model for eval
messages=[{"role": "user", "content": eval_prompt}]
)
return float(result.choices[0].message.content.strip())
score = evaluate_response(question, response) trace.score( name="quality-llm-judge", value=score )
dataset = langfuse.create_dataset(name="support-qa-v1")
langfuse.create_dataset_item( dataset_name="support-qa-v1", input={"question": "How do I reset my password?"}, expected_output="Go to settings > security > reset password" )
dataset = langfuse.get_dataset("support-qa-v1")
for item in dataset.items: # Generate response response = generate_response(item.input["question"])
# Link to dataset item
trace = langfuse.trace(name="eval-run")
trace.generation(
name="response",
input=item.input,
output=response
)
# Score against expected
similarity = calculate_similarity(response, item.expected_output)
trace.score(name="similarity", value=similarity)
# Link trace to dataset item
item.link(trace, "eval-run-1")
Clean instrumentation with decorators
When to use: Function-based applications
from langfuse.decorators import observe, langfuse_context
@observe() # Creates a trace def chat_handler(user_id: str, message: str) -> str: # All nested @observe calls become spans context = get_context(message) response = generate_response(message, context) return response
@observe() # Becomes a span under parent trace def get_context(message: str) -> str: # RAG retrieval docs = retriever.get_relevant_documents(message) return "\n".join([d.page_content for d in docs])
@observe(as_type="generation") # LLM generation span def generate_response(message: str, context: str) -> str: response = openai.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": f"Context: {context}"}, {"role": "user", "content": message} ] ) return response.choices[0].message.content
@observe() def main_flow(user_input: str): # Update current trace langfuse_context.update_current_trace( user_id="user-123", session_id="session-456", tags=["production"] )
result = process(user_input)
# Score the trace
langfuse_context.score_current_trace(
name="success",
value=1 if result else 0
)
return result
@observe() async def async_handler(message: str): result = await async_generate(message) return result
Skills: langfuse, langgraph
Workflow:
1. Build agent with LangGraph
2. Add Langfuse callback handler
3. Trace all LLM calls and tool uses
4. Score outputs for quality
5. Monitor and iterate
Skills: langfuse, structured-output
Workflow:
1. Build RAG with retrieval and generation
2. Trace retrieval and LLM calls
3. Score relevance and accuracy
4. Track costs and latency
5. Optimize based on data
Skills: langfuse, langgraph, structured-output
Workflow:
1. Build agent with structured outputs
2. Create evaluation dataset
3. Run evaluations with traces
4. Compare prompt versions
5. Deploy best performers
Works well with: langgraph, crewai, structured-output, autonomous-agents