Name: Self-Improving AI Skill
Author: jose-compu

Self-Improving AI Skill

Captures learnings about GenAI/LLM configuration, model selection, inference optimization, fine-tuning, RAG pipelines, prompt engineering, multimodal processing, and cost management. Use when: (1) Model response quality degrades after a provider update or version change, (2) Inference latency exceeds acceptable thresholds, (3) Fine-tuned model regresses on evaluation benchmarks, (4) RAG retrieval returns irrelevant or stale chunks, (5) Token costs exceed budget projections, (6) Hallucination rate increases on factual queries, (7) Context window overflows cause critical information truncation, (8) Multimodal pipeline fails on specific input types (image, audio, video, PDF), (9) A better model or configuration is discovered for a task, (10) Guardrails block valid output or miss harmful content.

jose-compu0 星标2026年4月13日

职业
分类: 机器学习

Log AI/LLM-specific learnings, model issues, and feature requests to markdown files for continuous improvement. Captures model selection insights, prompt optimization patterns, inference tuning, fine-tuning regressions, RAG pipeline improvements, embedding management, multimodal processing failures, evaluation findings, and guardrail adjustments. Important learnings get promoted to model selection matrices, prompt libraries, fine-tuning runbooks, RAG architecture docs, inference optimization checklists, evaluation benchmarks, or guardrail policies.

First-Use Initialisation

Before logging anything, ensure the .learnings/ directory and files exist in the project or workspace root. If any are missing, create them:

mkdir -p .learnings
[ -f .learnings/LEARNINGS.md ] || printf "# AI / LLM Learnings\n\nModel selection insights, prompt optimization patterns, inference tuning, fine-tuning lessons, RAG pipeline improvements, embedding management, multimodal processing, evaluation findings, and guardrail adjustments.\n\n**Categories**: model_selection | prompt_optimization | inference_latency | fine_tune_regression | context_management | modality_gap | hallucination_rate | cost_efficiency\n**Areas**: model_config | prompt_engineering | fine_tuning | rag_pipeline | inference | embeddings | multimodal | evaluation | guardrails\n\n---\n" > .learnings/LEARNINGS.md
[ -f .learnings/MODEL_ISSUES.md ] || printf "# Model Issues Log\n\nInference failures, model regressions, RAG retrieval problems, embedding drift, multimodal pipeline errors, and guardrail misfires.\n\n---\n" > .learnings/MODEL_ISSUES.md
[ -f .learnings/FEATURE_REQUESTS.md ] || printf "# Feature Requests\n\nCapabilities needed for model selection, inference optimization, fine-tuning, RAG pipelines, multimodal processing, and evaluation.\n\n---\n" > .learnings/FEATURE_REQUESTS.md

Self-Improving AI Skill

jose-compu0 星标2026年4月13日

职业
分类: 机器学习

First-Use Initialisation

Before logging anything, ensure the .learnings/ directory and files exist in the project or workspace root. If any are missing, create them:

mkdir -p .learnings [ -f .learnings/LEARNINGS.md ] || printf "# AI / LLM Learnings\n\nModel selection insights, prompt optimization patterns, inference tuning, fine-tuning lessons, RAG pipeline improvements, embedding management, multimodal processing, evaluation findings, and guardrail adjustments.\n\n**Categories**: model_selection | prompt_optimization | inference_latency | fine_tune_regression | context_management | modality_gap | hallucination_rate | cost_efficiency\n**Areas**: model_config | prompt_engineering | fine_tuning | rag_pipeline | inference | embeddings | multimodal | evaluation | guardrails\n\n---\n" > .learnings/LEARNINGS.md [ -f .learnings/MODEL_ISSUES.md ] || printf "# Model Issues Log\n\nInference failures, model regressions, RAG retrieval problems, embedding drift, multimodal pipeline errors, and guardrail misfires.\n\n---\n" > .learnings/MODEL_ISSUES.md [ -f .learnings/FEATURE_REQUESTS.md ] || printf "# Feature Requests\n\nCapabilities needed for model selection, inference optimization, fine-tuning, RAG pipelines, multimodal processing, and evaluation.\n\n---\n" > .learnings/FEATURE_REQUESTS.md

Situation	Action
Model quality drops after provider update	Log to `.learnings/MODEL_ISSUES.md` with model version details
Latency spike on inference	Log to `.learnings/MODEL_ISSUES.md` with latency measurement
Fine-tuned model regresses on eval	Log to `.learnings/LEARNINGS.md` with `fine_tune_regression`
RAG returns wrong or stale chunks	Log to `.learnings/MODEL_ISSUES.md` with retrieval details
Token cost over budget	Log to `.learnings/LEARNINGS.md` with `cost_efficiency`
Hallucination detected	Log to `.learnings/LEARNINGS.md` with `hallucination_rate`
Context window overflow	Log to `.learnings/LEARNINGS.md` with `context_management`
Better model discovered for task	Log to `.learnings/LEARNINGS.md` with `model_selection`
Prompt tweak improves output significantly	Log to `.learnings/LEARNINGS.md` with `prompt_optimization`
Multimodal input fails (image/audio/video)	Log to `.learnings/MODEL_ISSUES.md` with modality details
Embedding quality degrades	Log to `.learnings/MODEL_ISSUES.md` with similarity metrics
Guardrail false positive	Log to `.learnings/LEARNINGS.md` with guardrails note
New AI capability needed	Log to `.learnings/FEATURE_REQUESTS.md`

Learning Type	Promote To	Example
Model behavior patterns	`SOUL.md`	"Claude 4 tends to over-qualify, use direct prompting"
Model selection & routing	`AGENTS.md`	"Use fast model for triage, capable model for code gen"
Model/tool configuration	`TOOLS.md`	"Set temperature 0.1 for code, 0.7 for creative"
Model selection insights	Model selection matrix	"Sonnet for code gen, Opus for complex reasoning"
Prompt patterns that work	Prompt library	"Chain-of-thought improves code quality by 35%"
Fine-tuning lessons	Fine-tuning runbook	"Always include replay data to prevent forgetting"
RAG improvements	RAG architecture doc	"Chunk by content type, not fixed token size"
Inference optimizations	Performance tuning guide	"Cache system prompts, batch similar requests"
Evaluation findings	Benchmark suite docs	"HumanEval + internal eval for code gen models"
Guardrail tuning	Guardrail policy doc	"Lower toxicity threshold for customer-facing"

Category	Use When
`model_selection`	A different model performs better for a task (quality, cost, or latency)
`prompt_optimization`	A prompt change significantly improves output quality or efficiency
`inference_latency`	Latency exceeds thresholds or optimization opportunity found
`fine_tune_regression`	Fine-tuned model scores below baseline on eval benchmarks
`context_management`	Context window overflow, lost-in-the-middle, or prompt structure issue
`modality_gap`	Model fails on specific input type (image, audio, video, PDF)
`hallucination_rate`	Model produces factually incorrect output on known-fact queries
`cost_efficiency`	Token cost exceeds budget or cost optimization opportunity found

Target	What Belongs There
Model selection matrix	Which model for which task, with benchmarks
Prompt library	Proven prompt patterns, system prompt templates
Fine-tuning runbook	Training procedures, data mixing ratios, eval gates
RAG architecture doc	Chunking strategy, embedding models, retrieval ranking
Performance tuning guide	Inference caching, batching, quantization, provider routing
Benchmark suite docs	Eval methodology, baseline scores, regression thresholds
Guardrail policy doc	Content filtering rules, PII detection, output validation
`AGENTS.md`	Model routing, multi-agent workflows

Priority	When to Use	AI Examples
`critical`	Model producing harmful/dangerous output, data leaking through model, fine-tuned model catastrophic regression, guardrail completely bypassed	Hallucination in medical/legal advice, PII in model output, 45% drop on coding benchmarks after fine-tune
`high`	Significant quality drop, latency >10x baseline, cost overrun >50%, hallucination on critical facts, multimodal pipeline broken	Model fails all rotated PDF scans, 3x latency after provider update, daily cost doubled
`medium`	Model selection could be better, prompt optimization opportunity, minor eval regression, cost optimization, embedding refresh needed	Sonnet 23% better than GPT-4o for code gen, chain-of-thought adds 35% quality, embedding dimension reduction saves 60%
`low`	Documentation of model behavior, minor prompt tweak, config cleanup	Model prefers bullet lists over paragraphs, temperature 0.1 vs 0.0 negligible difference

Area	Scope
`model_config`	Model selection, version pinning, parameter tuning (temperature, top-p, top-k, max tokens, stop sequences), provider configuration, fallback chains
`prompt_engineering`	System prompts, few-shot examples, chain-of-thought, prompt templates, prompt compression, prompt caching
`fine_tuning`	Training data curation, hyperparameters, eval sets, RLHF/DPO, LoRA/QLoRA, catastrophic forgetting, checkpoint management
`rag_pipeline`	Chunking strategy, embedding models, vector stores, retrieval ranking, reranking, hybrid search, context assembly
`inference`	Latency optimization, batching, streaming, caching, quantization, speculative decoding, KV cache, provider routing
`embeddings`	Model selection, dimension reduction, index management, drift detection, similarity thresholds, multi-language
`multimodal`	Vision (image/PDF), audio (speech/music), video, cross-modal retrieval, modality-specific prompting, output format handling
`evaluation`	Benchmarks, human eval, automated eval (LLM-as-judge), A/B testing, regression testing, domain-specific metrics
`guardrails`	Content filtering, PII detection, toxicity, factuality checks, output validation, structured output enforcement, jailbreak prevention

Script	Hook Type	Purpose
`scripts/activator.sh`	UserPromptSubmit	Reminds to evaluate AI/model learnings after tasks
`scripts/error-detector.sh`	PostToolUse (Bash)	Triggers on model API errors, rate limits, inference failures

Criterion	Description
Recurring	Same model issue or pattern in 2+ projects or task types
Verified	Status is `resolved` with confirmed benchmark improvement
Non-obvious	Required investigation, benchmarking, or A/B testing
Broadly applicable	Not specific to one model version; useful across providers
User-flagged	User says "save this as a skill" or similar

Agent	Activation	Detection
Claude Code	Hooks (UserPromptSubmit, PostToolUse)	Automatic via error-detector.sh
Codex CLI	Hooks (same pattern)	Automatic via hook scripts
GitHub Copilot	Manual (`.github/copilot-instructions.md`)	Manual review
OpenClaw	Workspace injection + inter-agent messaging	Via session tools

Self-Improving AI Skill

First-Use Initialisation

Self-Improving AI Skill

First-Use Initialisation

Quick Reference

OpenClaw Setup (Recommended)

Installation

Workspace Structure

Create Learning Files

Promotion Targets

Optional: Enable Hook

Generic Setup (Other Agents)

Add reference to agent files

Self-Improving AI Workflow

Logging Format

Learning Entry [LRN-YYYYMMDD-XXX]

Model Issue Entry [MDL-YYYYMMDD-XXX]

Feature Request Entry [FEAT-YYYYMMDD-XXX]

ID Generation

Resolving Entries

Promoting to Project Memory

When to Promote

Promotion Targets

How to Promote

Recurring Pattern Detection

Simplify & Harden Feed

Periodic Review

When to Review

Quick Status Check

Review Actions

Detection Triggers

Priority Guidelines

Area Tags

Model Lifecycle Management

Deprecation Tracking

Migration Planning

Best Practices

Gitignore Options

Hook Integration

Quick Setup (Claude Code / Codex)

Advanced Setup (With Error Detection)

Available Hook Scripts

Automatic Skill Extraction

Skill Extraction Criteria

Extraction Workflow

Extraction Detection Triggers

Multi-Agent Support

Model Routing for Multi-Agent Setups

Stackability Contract (Standalone + Multi-Skill)

Namespaced Logging (recommended for 2+ skills)

Required Metadata

Hook Arbitration (when 2+ skills are enabled)

Narrow Matcher Scope (ai)

Cross-Skill Precedence

Ownership Rules

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns