Add new AI models to Kiln's ml_model_list.py and produce a Discord announcement. Use when the user wants to add, integrate, or register a new LLM model (e.g. Claude, GPT, DeepSeek, Gemini, Kimi, Qwen, Grok) into the Kiln model list, mentions adding a model to ml_model_list.py, or asks to discover/find new models that are available but not yet in Kiln.
Integrating a new model into libs/core/kiln_ai/adapters/ml_model_list.py requires:
ModelName enum – add an enum memberbuilt_in_models list – add a KilnModel(...) entry with providersModelFamily enum – only if the vendor is brand-newAfter code changes, run paid integration tests, then draft a Discord post.
These apply throughout the entire workflow.
model_id must come from an authoritative source (LiteLLM catalog, official docs, API reference, or changelog). If you can't verify a slug, tell the user and ask them to provide it.If the user asks you to find new models, do NOT just web search "new AI models this week" — that only surfaces major releases. Instead, systematically check each family against both the LiteLLM catalog and models.dev, then union the results. Both are attempts to catalog available models and each has gaps the other fills.
Read the ModelFamily and ModelName enums to know what we already have.
Query both catalogs for each family (run in parallel where possible):
LiteLLM catalog — filters out mirror providers to avoid duplicates:
curl -s 'https://api.litellm.ai/model_catalog?model=SEARCH_TERM&mode=chat&page_size=500' -H 'accept: application/json' | jq '[.data[] | select(.provider != "openrouter" and .provider != "bedrock" and .provider != "bedrock_converse" and .provider != "vertex_ai-anthropic_models" and .provider != "azure") | .id] | unique | .[]'
models.dev — search all model IDs across all providers:
curl -s https://models.dev/api.json | jq '[to_entries[].value.models // {} | keys[]] | .[]' | grep -i "SEARCH_TERM"
For details on a specific provider+model: curl -s https://models.dev/api.json | jq '.["PROVIDER"].models["MODEL_ID"]'
Search terms (one query per term):
claude, gpt, o1, o3, o4 (OpenAI reasoning), gemini, llama, deepseek, qwen, qwq, mistral, grok, kimi, glm, minimax, hunyuan, ernie, phi, gemma, seed, step, pangu
Union and cross-reference results from both catalogs against ModelName. A model found in either source counts as available. Focus on direct-provider entries (not OpenRouter/Bedrock/Azure mirrors). Skip pure coding models (e.g. codestral, deepseek-coder, qwen-coder).
Run targeted web searches per family to catch very fresh releases not yet in either catalog:
"[family] new model [current year]""[family] release [current month] [current year]"Present findings as a summary. Let the user decide which to add.
Some providers — Fireworks AI, Together AI, SiliconFlow — expose new models on their own endpoints 1–2 weeks before those entries surface in models.dev / LiteLLM. Relying only on those two catalogs will both under-populate the provider list for the model you're adding now and miss the window to backfill recently-added models whose provider support has since grown.
Run this check on every invocation of the skill, regardless of whether you're in discovery mode or adding a specific model.
Pull the 10 most recently added models from the top of built_in_models in ml_model_list.py (newest are at the top), or from git:
git log --follow -p -- libs/core/kiln_ai/adapters/ml_model_list.py | grep -E "^\+\s+name=ModelName\." | head -20
For the model you're adding (if any) AND each of those 10 models, cross-check Fireworks, Together, and SiliconFlow directly using the endpoints in the Lagging Providers Reference. Do NOT trust models.dev / LiteLLM as the final word for these three providers.
If a lagging provider now supports a recently-added model that isn't yet in its KilnModel entry, flag it to the user and propose either bundling the provider addition into the current change or opening a separate PR. Do not silently add it.
Read the predecessor model in ml_model_list.py (e.g. for Opus 4.6 → read Opus 4.5). You inherit most parameters from it.
Query the LiteLLM catalog for the new model. This is the primary slug source since Kiln uses LiteLLM. See the Slug Lookup Reference for query syntax and all verified sources.
Get the OpenRouter slug via:
curl -s https://openrouter.ai/api/v1/models | jq '.data[].id' | grep -i "SEARCH_TERM"openrouter [model name] model idGet the direct-provider slug (Anthropic, OpenAI, Google, etc.). Use the LiteLLM catalog first, then official docs. See the Slug Lookup Reference for provider-specific URLs.
Identify quirks — check the Provider Quirks Reference for the relevant provider, and web search for any new quirks:
reasoning_capable, parsers, OpenRouter options)?temp_top_p_exclusive, etc.)?max_parallel_requests)?Determine thinking levels — does the model support configurable reasoning effort? See Thinking Levels Reference for the full lookup chain. Key quick checks:
supported_parameters — if reasoning is absent, skip thinking levelsAll changes go in libs/core/kiln_ai/adapters/ml_model_list.py.
ModelName enumclaude_opus_4_6 = "claude_opus_4_6"KilnModel entry in built_in_modelsname, friendly_name, model_id per provider, flagsfriendly_name must follow the existing naming pattern of sibling models in the same family. Check the predecessor. For example, Claude Sonnets use "Claude {version} Sonnet" (e.g. "Claude 4.5 Sonnet"), not "Claude Sonnet {version}". Do NOT use the vendor's marketing name if it differs from Kiln's established convention.Provider model_id formats:
| Provider | Format | Notes |
|---|---|---|
openrouter | vendor/model-name | Always verify via API |
openai | Bare model name | Verify via OpenAI docs |
anthropic | Variable — older models have date stamps, newer may not | Always verify via Anthropic docs |
gemini_api | Bare name | Verify via Google AI Studio docs |
fireworks_ai | accounts/fireworks/models/... | Verify via Fireworks docs |
together_ai | Vendor path format | Verify via Together docs |
vertex | Usually same as gemini_api | Verify via Vertex docs |
siliconflow_cn | Vendor/model format | Verify via SiliconFlow docs |
Every single model_id must be verified from an authoritative source. No exceptions.
Setting flags — use catalog data + predecessor as dual signals:
The LiteLLM catalog and models.dev responses include capability flags (supports_vision, supports_function_calling, supports_reasoning, etc.). Use these as the primary signal for what to enable on the new model:
supports_vision: true → enable supports_vision, multimodal_capable, and vision MIME types (see 2c)supports_function_calling: true → use StructuredOutputMode.json_schema (or function_calling depending on provider norms — check predecessor)supports_reasoning: true → enable reasoning_capable and check if parser/formatter/thinking flags are neededThen cross-check against the predecessor. The predecessor tells you how Kiln configures a similar model (which structured_output_mode, which provider-specific flags, etc.). The catalog tells you what the model can do. Use both:
temp_top_p_exclusive but nothing in the catalog mentions it? Keep it — it's a provider quirk the catalog doesn't track.Common flags:
structured_output_mode – how the model handles JSON outputsuggested_for_evals / suggested_for_data_gen – see zero-sum rule belowmultimodal_capable / supports_vision / supports_doc_extraction – see multimodal rules belowreasoning_capable – for thinking/reasoning modelstemp_top_p_exclusive – Anthropic models that can't have both temp and top_pparser / formatter – for models needing special parsing (e.g. R1-style thinking)If the model supports non-text inputs, configure:
multimodal_capable=True and supports_doc_extraction=True if it supports any MIME typessupports_vision=True if it supports imagesmultimodal_requires_pdf_as_image=True if vision-capable but no native PDF support (also add KilnMimeType.PDF to MIME list). Always set this on OpenRouter providers — OpenRouter routes PDFs through Mistral OCR which breaks LiteLLM parsing.KilnMimeType.TXT and KilnMimeType.MD on any multimodal_capable modelStrategy: start broad, narrow based on test failures. Enable a generous set of MIME types, run tests, and remove only types the provider explicitly rejects (400 errors). Don't remove types for timeout/auth/content-mismatch failures.
Full MIME superset (Gemini uses all):
# documents
KilnMimeType.PDF, KilnMimeType.CSV, KilnMimeType.TXT, KilnMimeType.HTML, KilnMimeType.MD
# images
KilnMimeType.JPG, KilnMimeType.PNG
# audio
KilnMimeType.MP3, KilnMimeType.WAV, KilnMimeType.OGG
# video
KilnMimeType.MP4, KilnMimeType.MOV
suggested_for_evals / suggested_for_data_genOnly set these if the predecessor already has them, OR web search shows the model is a clear SOTA leap (ask user to confirm first).
Zero-sum rule: When adding a new model with these flags, remove them from the oldest same-family model to keep the suggested count stable. Ask the user to confirm the swap before making changes.
ModelFamily enum (only if needed)Only add a new family if the vendor is completely new.
available_thinking_levels / default_thinking_level)If the model supports configurable reasoning effort (not just on/off), add available_thinking_levels and default_thinking_level to each provider entry. See Thinking Levels Reference for the full lookup chain and existing constants.
Quick rules:
_THINKING_LEVELS constant if the levels match exactly{MODEL}_{PROVIDER_CONTEXT}_THINKING_LEVELSdefault_thinking_level must be one of the values in available_thinking_levelsTests call real LLMs and cost money. Ideally the user only needs to consent to two script executions: the smoke test, then the full parallel suite.
Vertex AI authentication: Vertex tests require active gcloud credentials. If you are changing a model that uses Vertex, you must not run the test until asking the user to run gcloud auth application-default login before trying. These failures are auth issues, not model config problems.
-k filter syntax: Always use bracket notation for model+provider filtering, never and:
-k "test_name[glm_5-fireworks_ai]" or -k "glm_5"-k "glm_5 and fireworks" — and is a pytest keyword expression that can match wrong testsBefore running paid tests, enable parallel testing in pytest.ini:
# Change this line:
# addopts = -n auto
# To:
addopts = -n 8
Important: Revert this change after all tests complete (re-comment the line).
Run a single test+provider combo first:
uv run pytest --runpaid --ollama -k "test_data_gen_sample_all_models_providers[MODEL_ENUM-PROVIDER]"
If it fails, fix the slug/config before proceeding. Use --collect-only to find exact parameter IDs if unsure.
uv run pytest --runpaid --ollama -k "MODEL_ENUM" -v 2>&1 | grep -E "PASSED|FAILED|ERROR|short test|=====|collected"
If tests fail — debug one at a time:
-v for full outputAnthropic API key gotcha: if an Anthropic-direct test fails with an auth/API key error, check whether the user's environment exports the key as KILN_ANTHROPIC_API_KEY instead of ANTHROPIC_API_KEY (the Kiln app uses the prefixed name; the Anthropic SDK used by tests expects the unprefixed name). Prepend the test command with a one-shot alias — don't export it globally:
ANTHROPIC_API_KEY="$KILN_ANTHROPIC_API_KEY" uv run pytest --runpaid ...
supports_doc_extraction=True)Tests are in libs/core/kiln_ai/adapters/extractors/test_litellm_extractor.py.
# See what will run:
uv run pytest --collect-only libs/core/kiln_ai/adapters/extractors/test_litellm_extractor.py::test_extract_document_success -q | grep MODEL_ENUM
# Run them:
uv run pytest --runpaid --ollama libs/core/kiln_ai/adapters/extractors/test_litellm_extractor.py::test_extract_document_success -k "MODEL_ENUM"
If a provider rejects a data type (400 error), remove that KilnMimeType and re-run.
After all tests complete, revert pytest.ini back to the commented-out state:
# addopts = -n auto
Collect test results for use in the PR body (Phase 5). Organize by model name and provider using these symbols:
This skill is often run via Claude Code Web (Slack connector). That environment has a non-user-configurable stop hook which, at end of session, will:
The problems this causes:
add-model/* branches on the remote.The user's desires, in priority order:
git restore / git clean the specific files you touched) and delete any branch you created (git checkout main && git branch -D add-model/MODEL_NAME) so the stop hook sees a clean tree and exits cleanly. Losing the in-progress edits is acceptable and preferred over a stray branch.Do NOT commit, push, or create a branch if any of the following are true:
If any of the above apply, stop and ask the user what to do. Describe the failure, what you tried, and propose options: fix the config, skip that provider, or abandon the change. Only proceed to 5a once the user explicitly confirms.
After all tests pass and pytest.ini is reverted, commit the changes and open a PR against main.
add-model/MODEL_NAME (e.g. add-model/glm-5-1)ml_model_list.py)Use gh pr create against main. The PR body must follow this exact format:
## What does this PR do?
Test Results
[Two paragraphs of nuance — describe any unusual findings, things you tried and reverted, known pre-existing failures vs new failures, API quirks discovered, and any config adjustments made during testing.]
[Model Name] ([provider]):
- [N] passed, [N] skipped[, [N] failed]
- [Any notable failures or flakes]
[Repeat for each model+provider combo]
---
[Model Name] ([provider]):
✅ test_data_gen_all_models_providers[model_enum-provider]
✅ test_data_gen_sample_all_models_providers[model_enum-provider]
✅ test_data_gen_sample_all_models_providers_with_structured_output[model_enum-provider]
✅ test_all_built_in_models_llm_as_judge[model_enum-provider]
✅ test_all_built_in_models_structured_output[model_enum-provider]
✅ test_all_built_in_models_structured_input[model_enum-provider]
✅ test_structured_output_cot_prompt_builder[model_enum-provider]
✅ test_all_models_providers_plaintext[model_enum-provider]
✅ test_cot_prompt_builder[model_enum-provider]
⚠️ test_structured_input_cot_prompt_builder[model_enum-provider] — brief reason
❌ test_name[model_enum-provider] — brief reason
[Repeat for each model+provider combo]
## Checklists
- [X] Tests have been run locally and passed
- [X] New tests have been added to any work in /lib
Rules for the PR body:
[Model Name] ([provider]): headers--- lists every individual test resultModelName enum entry added (before predecessor)KilnModel entry added to built_in_models (before predecessor)friendly_name matches the naming pattern of sibling models in the same familyModelFamily enum updated (only if new family)app/web_ui/src/routes/(app)/docs/rag_configs/[project_id]/add_search_tool/rag_config_templates.tspytest.ini (addopts = -n 8)pytest.ini (re-commented)main with test results in the bodytemp_top_p_exclusive=Truejson_schema; older Opus uses function_callinganthropic_extended_thinking=True + reasoning_capable=Truejson_schema for structured outputavailable_thinking_levels — see Thinking Levels Referencegemini_reasoning_enabled=True for reasoning-capable modelsavailable_thinking_levels — see Thinking Levels Referenceparser=ModelParserID.r1_thinking + reasoning_capable=Truer1_openrouter_options=True + require_openrouter_reasoning=Truevendor/model-namerequire_openrouter_reasoning=Trueopenrouter_skip_required_parameters=Truelogprobs_openrouter_options=True if supportedmultimodal_requires_pdf_as_image=True (OpenRouter's PDF routing breaks LiteLLM)reasoning_capable=True, parser=ModelParserID.r1_thinkingformatter=ModelFormatterID.qwen3_style_no_thinksiliconflow_enable_thinking=True/FalseNo API provides the available thinking levels programmatically — they must be manually sourced. Use this lookup chain in priority order:
Vendor model page (most authoritative)
https://developers.openai.com/api/docs/models/{model-id}low, medium, high, max; Sonnet 4.6 supports low, medium, high.thinking: true/false (boolean only). Levels come from docs.Vercel AI Gateway docs — clean structured tables per provider:
https://vercel.com/docs/ai-gateway/capabilities/reasoning/openaihttps://vercel.com/docs/ai-gateway/capabilities/reasoning/anthropichttps://vercel.com/docs/ai-gateway/capabilities/reasoning/googleInherit from predecessor — if the same family/tier model has a _THINKING_LEVELS dict, the new model very likely uses the same or a superset.
OpenRouter supported_parameters — check if reasoning is present:
curl -s https://openrouter.ai/api/v1/models | jq '.data[] | select(.id == "SLUG") | .supported_parameters'
If reasoning is absent, the model does not support effort levels — skip thinking levels entirely.
Smoke test — as a last resort, send a request with an invalid effort level and check the error message, which often enumerates the valid values.
available_thinking_levels dictsreasoning_capable=True + parser=ModelParserID.r1_thinking. Do NOT add thinking level dicts.Reuse when levels match exactly. Create a new constant only if levels differ. This is not an exhaustive list.
| Constant | Levels | Default | Used by |
|---|---|---|---|
GPT_5_4_OPENAI_THINKING_LEVELS | none, low, medium, high, xhigh | none | GPT-5.4 |
GPT_5_4_PRO_OPENAI_THINKING_LEVELS | medium, high, xhigh | medium | GPT-5.4 Pro |
GPT_5_2_OPENAI_THINKING_LEVELS | none, low, medium, high, xhigh | none | GPT-5.2, GPT-5.2 Chat |
GPT_5_2_PRO_OPENAI_THINKING_LEVELS | medium, high, xhigh | medium | GPT-5.2 Pro |
GPT_5_1_OPENAI_THINKING_LEVELS | none, low, medium, high | none | GPT-5.1 |
GPT_5_OPENAI_THINKING_LEVELS | minimal, low, medium, high | medium | GPT-5, GPT-5 Mini, GPT-5 Nano, GPT-5 Chat |
GEMINI_3_PRO_THINKING_LEVELS | low, medium, high | high | Gemini 3 Pro, Gemini 3.1 Pro |
GEMINI_3_FLASH_THINKING_LEVELS | minimal, low, medium, high | high | Gemini 3 Flash, Gemini 3.1 Flash Lite |
CLAUDE_ANTHROPIC_EFFORT_THINKING_LEVELS | low, medium, high | high | Claude (Anthropic direct) |
CLAUDE_OPENROUTER_THINKING_LEVELS | none, minimal, low, medium, high, xhigh | none | Claude (OpenRouter) |
These were investigated and confirmed to lack thinking level data:
reasoning in supported_parameterssupports_reasoning: true/falsethinking: true/false/v1/models endpoint — minimal object with no capability fieldsUse both LiteLLM and models.dev when looking up slugs — they complement each other. LiteLLM gives you the exact slugs Kiln will use (since Kiln runs on LiteLLM), while models.dev often has broader coverage of newer or niche models with pricing, context limits, and capability details.
100 free requests/day, no key needed. Supports server-side filtering: model= (substring match), provider=, mode=, supports_vision=true, supports_reasoning=true, page_size=500.
# Find all variants of a model across providers:
curl -s 'https://api.litellm.ai/model_catalog?model=MODEL_NAME&mode=chat&page_size=500' \
-H 'accept: application/json' | jq '.data[] | {id, provider, mode, max_input_tokens, supports_vision, supports_reasoning, supports_function_calling}'
# List all models for a provider:
curl -s 'https://api.litellm.ai/model_catalog?provider=PROVIDER&mode=chat&page_size=500' \
-H 'accept: application/json' | jq '.data[].id'
Mega JSON covering 50+ providers with model IDs, pricing, context limits, capabilities, and release dates. Large file — always use curl+jq, never WebFetch.
# Search all model IDs across all providers:
curl -s https://models.dev/api.json | jq '[to_entries[].value.models // {} | keys[]] | .[]' | grep -i "SEARCH_TERM"
# List all model IDs for a specific provider:
curl -s https://models.dev/api.json | jq '.["PROVIDER"].models | keys[]'
# Get full details for a specific provider+model:
curl -s https://models.dev/api.json | jq '.["PROVIDER"].models["MODEL_ID"]'
curl -s https://openrouter.ai/api/v1/models | jq '.data[].id' | grep -i "SEARCH_TERM"Fireworks, Together, and SiliconFlow typically expose new models on their own endpoints 1–2 weeks before models.dev / LiteLLM catch up. For these providers, always cross-check directly — both when adding a new model and when running the Phase 1B backfill check.
Fireworks AI — model pages are the most current source. WebFetch directly:
WebFetch https://fireworks.ai/models/fireworks/{model-slug}
Or browse the catalog at https://fireworks.ai/models. Kiln slug format: accounts/fireworks/models/{model-slug}.
Together AI — the /v1/models endpoint requires an API key. $TOGETHER_API_KEY is typically set in the user's shell:
# List all Together model IDs matching a term:
curl -s https://api.together.xyz/v1/models \
-H "Authorization: Bearer $TOGETHER_API_KEY" | jq '.[] | .id' | grep -i "SEARCH_TERM"
# Full record for a specific slug:
curl -s https://api.together.xyz/v1/models \
-H "Authorization: Bearer $TOGETHER_API_KEY" | jq '.[] | select(.id == "SLUG")'
If the key isn't set, ask the user before prompting them to export it — don't fail silently onto models.dev.
SiliconFlow — WebFetch the public model catalog page, or a specific model page if you have the vendor/model path:
WebFetch https://siliconflow.com/models
WebFetch https://siliconflow.com/models/{vendor}/{model}
When you find a new reliable slug source, append it here.