Use when designing memory systems for AI agents — tiered memory architecture (in-context, session, long-term, episodic), context window management, memory compression, and retrieval strategies for persistent agent state.
Agents forget everything between sessions by default. Building memory into agents requires deliberate architecture: choosing what to store, where, for how long, and how to retrieve it efficiently without polluting the context window.
Four tiers, each with different storage, latency, and persistence:
| Tier | Storage | Latency | Lifetime | Use case |
|---|---|---|---|---|
| In-context |
| Token window (4K-200K) |
| 0ms |
| Current session |
| Active task state, recent tool results, current conversation |
| Session | Redis / Postgres | 1-10ms | One conversation | Conversation history, user preferences in session, task progress |
| Long-term | Vector DB + key-value | 10-100ms | Persistent | User facts, learned patterns, past decisions |
| Episodic | DB + vector embeddings | 50-200ms | Persistent | Past task completions, examples, learned workflows |
class TieredMemory:
def __init__(self):
self.in_context = [] # current messages
self.session = SessionStore() # Redis
self.long_term = VectorDB() # Pinecone/Weaviate/pgvector
self.episodic = EpisodicDB() # past task completions
def recall(self, query: str, tiers=["session", "long_term"]) -> list[Memory]:
results = []
if "session" in tiers:
results.extend(self.session.get_relevant(query))
if "long_term" in tiers:
results.extend(self.long_term.search(query, top_k=5))
if "episodic" in tiers:
results.extend(self.episodic.find_similar_tasks(query, top_k=3))
return deduplicate(results, key="content")
Retrieve memory before the agent starts working — not mid-task. Front-loading relevant memory prevents mid-loop context changes.
The most common practical problem — context fills up in long conversations:
def manage_context_window(messages: list, max_tokens: int = 6000) -> list:
"""Keep context within limits using priority-based pruning"""
# Always keep: system prompt + last 5 messages + current user message
must_keep = [messages[0]] + messages[-6:]
middle = messages[1:-6]
current_tokens = count_tokens(must_keep)
if current_tokens < max_tokens:
# Add middle messages until we approach limit
for msg in reversed(middle):
msg_tokens = count_tokens([msg])
if current_tokens + msg_tokens < max_tokens * 0.85:
must_keep.insert(1, msg)
current_tokens += msg_tokens
else:
# Compress: summarize the middle section
if middle:
summary = summarize(middle)
must_keep.insert(1, {
"role": "system",
"content": f"[Summary of earlier conversation: {summary}]"
})
return must_keep
Strategies by situation:
Not everything should be remembered. Use selective storage:
def should_store_long_term(content: str, agent_output: str) -> bool:
"""Store only information that's useful across sessions"""
store_triggers = [
"user mentioned their name",
"user stated a strong preference",
"user corrected the agent",
"user shared context about their role/company",
"important decision was made",
"user expressed frustration with agent behavior",
]
# Use LLM to classify
return llm_classify(content, store_triggers)
def store_user_fact(fact: str, user_id: str, confidence: float):
long_term_db.upsert({
"user_id": user_id,
"fact": fact,
"embedding": embed(fact),
"confidence": confidence,
"source": "agent_extraction",
"created_at": now(),
"last_accessed": now()
})
Memory decay: Old, unaccessed memories should decay in confidence. Facts accessed frequently = higher confidence. Implement a cron job that reduces confidence by a small delta each week and prunes below a threshold (e.g., confidence < 0.2).
Agents improve by referencing how similar tasks were completed:
def store_completed_task(task_id, input, steps_taken, outcome, quality_score):
episodic_db.insert({
"task_id": task_id,
"input_embedding": embed(input),
"input_summary": summarize(input),
"steps": steps_taken,
"outcome": outcome,
"quality_score": quality_score,
"duration_seconds": elapsed,
"tools_used": [s.tool for s in steps_taken],
})
def recall_similar_tasks(current_input: str, top_k: int = 3) -> list[Episode]:
query_embedding = embed(current_input)
similar = episodic_db.search(query_embedding, top_k=top_k)
# Use these as few-shot examples in the agent's context
return similar
Only store completed tasks with quality_score above a threshold (e.g., > 0.7). Storing low-quality episodes teaches the agent bad patterns.
When multiple agents share memory:
class SharedAgentMemory:
"""Thread-safe shared memory for multi-agent systems"""
def write(self, agent_id: str, key: str, value: Any, scope: str = "shared"):
"""scope: 'agent' (private) or 'shared' (all agents can read)"""
memory_store.set(
key=f"{scope}:{key}",
value=value,
metadata={"written_by": agent_id, "timestamp": now()}
)
def read(self, agent_id: str, key: str) -> Any:
# Agents can always read shared scope
# Can only read agent scope if agent_id matches
return memory_store.get(f"shared:{key}") or \
memory_store.get(f"{agent_id}:{key}")
Multi-agent memory patterns:
agentic-ai-patterns for understanding where memory fits in the observe-think-act agent looprag-architecture for vector search patterns in long-term memory retrievalllm-observability to track memory hit rates, context window utilization, and retrieval latencyagentic-security — retrieved memories are external data and should be treated as untrusted if user-supplied@ai-engineer uses this when designing stateful agent systems