Name: Claude Maintain Models
Author: Kiln-AI

Claude Maintain Models | Skills Pool

Read the ModelFamily and ModelName enums to know what we already have.

Query both catalogs for each family (run in parallel where possible):

LiteLLM catalog — filters out mirror providers to avoid duplicates:

curl -s 'https://api.litellm.ai/model_catalog?model=SEARCH_TERM&mode=chat&page_size=500' -H 'accept: application/json' | jq '[.data[] | select(.provider != "openrouter" and .provider != "bedrock" and .provider != "bedrock_converse" and .provider != "vertex_ai-anthropic_models" and .provider != "azure") | .id] | unique | .[]'

models.dev — search all model IDs across all providers:

curl -s https://models.dev/api.json | jq '[to_entries[].value.models // {} | keys[]] | .[]' | grep -i "SEARCH_TERM"

For details on a specific provider+model: curl -s https://models.dev/api.json | jq '.["PROVIDER"].models["MODEL_ID"]'

Search terms (one query per term): claude, gpt, o1, o3, o4 (OpenAI reasoning), gemini, llama, deepseek, qwen, qwq, mistral, grok, kimi, glm, minimax, hunyuan, ernie, phi, gemma, seed, step, pangu
Union and cross-reference results from both catalogs against ModelName. A model found in either source counts as available. Focus on direct-provider entries (not OpenRouter/Bedrock/Azure mirrors). Skip pure coding models (e.g. codestral, deepseek-coder, qwen-coder).
Run targeted web searches per family to catch very fresh releases not yet in either catalog:
- "[family] new model [current year]"
- "[family] release [current month] [current year]"
Present findings as a summary. Let the user decide which to add.

Pull the 10 most recently added models from the top of built_in_models in ml_model_list.py (newest are at the top), or from git:
```
git log --follow -p -- libs/core/kiln_ai/adapters/ml_model_list.py | grep -E "^\+\s+name=ModelName\." | head -20
```
For the model you're adding (if any) AND each of those 10 models, cross-check Fireworks, Together, and SiliconFlow directly using the endpoints in the Lagging Providers Reference. Do NOT trust models.dev / LiteLLM as the final word for these three providers.
If a lagging provider now supports a recently-added model that isn't yet in its KilnModel entry, flag it to the user and propose either bundling the provider addition into the current change or opening a separate PR. Do not silently add it.

Provider	Format	Notes
`openrouter`	`vendor/model-name`	Always verify via API
`openai`	Bare model name	Verify via OpenAI docs
`anthropic`	Variable — older models have date stamps, newer may not	Always verify via Anthropic docs
`gemini_api`	Bare name	Verify via Google AI Studio docs
`fireworks_ai`	`accounts/fireworks/models/...`	Verify via Fireworks docs
`together_ai`	Vendor path format	Verify via Together docs
`vertex`	Usually same as gemini_api	Verify via Vertex docs
`siliconflow_cn`	Vendor/model format	Verify via SiliconFlow docs

# documents
KilnMimeType.PDF, KilnMimeType.CSV, KilnMimeType.TXT, KilnMimeType.HTML, KilnMimeType.MD
# images
KilnMimeType.JPG, KilnMimeType.PNG
# audio
KilnMimeType.MP3, KilnMimeType.WAV, KilnMimeType.OGG
# video
KilnMimeType.MP4, KilnMimeType.MOV

# Change this line:
# addopts = -n auto
# To:
addopts = -n 8

uv run pytest --runpaid --ollama -k "test_data_gen_sample_all_models_providers[MODEL_ENUM-PROVIDER]"

uv run pytest --runpaid --ollama -k "MODEL_ENUM" -v 2>&1 | grep -E "PASSED|FAILED|ERROR|short test|=====|collected"

ANTHROPIC_API_KEY="$KILN_ANTHROPIC_API_KEY" uv run pytest --runpaid ...

# See what will run:
uv run pytest --collect-only libs/core/kiln_ai/adapters/extractors/test_litellm_extractor.py::test_extract_document_success -q | grep MODEL_ENUM

# Run them:
uv run pytest --runpaid --ollama libs/core/kiln_ai/adapters/extractors/test_litellm_extractor.py::test_extract_document_success -k "MODEL_ENUM"

# addopts = -n auto

## What does this PR do?

 Test Results

[Two paragraphs of nuance — describe any unusual findings, things you tried and reverted, known pre-existing failures vs new failures, API quirks discovered, and any config adjustments made during testing.]

[Model Name] ([provider]):
- [N] passed, [N] skipped[, [N] failed]
- [Any notable failures or flakes]

[Repeat for each model+provider combo]

---
[Model Name] ([provider]):
✅ test_data_gen_all_models_providers[model_enum-provider]
✅ test_data_gen_sample_all_models_providers[model_enum-provider]
✅ test_data_gen_sample_all_models_providers_with_structured_output[model_enum-provider]
✅ test_all_built_in_models_llm_as_judge[model_enum-provider]
✅ test_all_built_in_models_structured_output[model_enum-provider]
✅ test_all_built_in_models_structured_input[model_enum-provider]
✅ test_structured_output_cot_prompt_builder[model_enum-provider]
✅ test_all_models_providers_plaintext[model_enum-provider]
✅ test_cot_prompt_builder[model_enum-provider]
⚠️ test_structured_input_cot_prompt_builder[model_enum-provider] — brief reason
❌ test_name[model_enum-provider] — brief reason

[Repeat for each model+provider combo]

## Checklists

- [X] Tests have been run locally and passed
- [X] New tests have been added to any work in /lib

Vendor model page (most authoritative)
- OpenAI: Each model page includes "Reasoning.effort supports: X, Y, Z" in the description text. URL: https://developers.openai.com/api/docs/models/{model-id}
- Anthropic: The effort docs list levels per model. Opus 4.6 supports low, medium, high, max; Sonnet 4.6 supports low, medium, high.
- Google Gemini: The models API returns thinking: true/false (boolean only). Levels come from docs.
Vercel AI Gateway docs — clean structured tables per provider:
- https://vercel.com/docs/ai-gateway/capabilities/reasoning/openai
- https://vercel.com/docs/ai-gateway/capabilities/reasoning/anthropic
- https://vercel.com/docs/ai-gateway/capabilities/reasoning/google
Inherit from predecessor — if the same family/tier model has a _THINKING_LEVELS dict, the new model very likely uses the same or a superset.
OpenRouter supported_parameters — check if reasoning is present:
```
curl -s https://openrouter.ai/api/v1/models | jq '.data[] | select(.id == "SLUG") | .supported_parameters'
```
If reasoning is absent, the model does not support effort levels — skip thinking levels entirely.
Smoke test — as a last resort, send a request with an invalid effort level and check the error message, which often enumerates the valid values.

Constant	Levels	Default	Used by
`GPT_5_4_OPENAI_THINKING_LEVELS`	none, low, medium, high, xhigh	none	GPT-5.4
`GPT_5_4_PRO_OPENAI_THINKING_LEVELS`	medium, high, xhigh	medium	GPT-5.4 Pro
`GPT_5_2_OPENAI_THINKING_LEVELS`	none, low, medium, high, xhigh	none	GPT-5.2, GPT-5.2 Chat
`GPT_5_2_PRO_OPENAI_THINKING_LEVELS`	medium, high, xhigh	medium	GPT-5.2 Pro
`GPT_5_1_OPENAI_THINKING_LEVELS`	none, low, medium, high	none	GPT-5.1
`GPT_5_OPENAI_THINKING_LEVELS`	minimal, low, medium, high	medium	GPT-5, GPT-5 Mini, GPT-5 Nano, GPT-5 Chat
`GEMINI_3_PRO_THINKING_LEVELS`	low, medium, high	high	Gemini 3 Pro, Gemini 3.1 Pro
`GEMINI_3_FLASH_THINKING_LEVELS`	minimal, low, medium, high	high	Gemini 3 Flash, Gemini 3.1 Flash Lite
`CLAUDE_ANTHROPIC_EFFORT_THINKING_LEVELS`	low, medium, high	high	Claude (Anthropic direct)
`CLAUDE_OPENROUTER_THINKING_LEVELS`	none, minimal, low, medium, high, xhigh	none	Claude (OpenRouter)

# Find all variants of a model across providers:
curl -s 'https://api.litellm.ai/model_catalog?model=MODEL_NAME&mode=chat&page_size=500' \
  -H 'accept: application/json' | jq '.data[] | {id, provider, mode, max_input_tokens, supports_vision, supports_reasoning, supports_function_calling}'

# List all models for a provider:
curl -s 'https://api.litellm.ai/model_catalog?provider=PROVIDER&mode=chat&page_size=500' \
  -H 'accept: application/json' | jq '.data[].id'

# Search all model IDs across all providers:
curl -s https://models.dev/api.json | jq '[to_entries[].value.models // {} | keys[]] | .[]' | grep -i "SEARCH_TERM"

# List all model IDs for a specific provider:
curl -s https://models.dev/api.json | jq '.["PROVIDER"].models | keys[]'

# Get full details for a specific provider+model:
curl -s https://models.dev/api.json | jq '.["PROVIDER"].models["MODEL_ID"]'

WebFetch https://fireworks.ai/models/fireworks/{model-slug}

# List all Together model IDs matching a term:
curl -s https://api.together.xyz/v1/models \
  -H "Authorization: Bearer $TOGETHER_API_KEY" | jq '.[] | .id' | grep -i "SEARCH_TERM"

# Full record for a specific slug:
curl -s https://api.together.xyz/v1/models \
  -H "Authorization: Bearer $TOGETHER_API_KEY" | jq '.[] | select(.id == "SLUG")'

WebFetch https://siliconflow.com/models
WebFetch https://siliconflow.com/models/{vendor}/{model}

Claude Maintain Models

Add a New AI Model to Kiln

Global Rules

Claude Maintain Models

Add a New AI Model to Kiln

Global Rules

Phase 1 – Model Discovery (only when asked to find new/missing models)

Phase 1B – Lagging-Provider Backfill Check (every run)

Phase 2 – Gather Context

Phase 3 – Code Changes

3a. ModelName enum

3b. KilnModel entry in built_in_models

2c. Multimodal capabilities

3d. suggested_for_evals / suggested_for_data_gen

3e. ModelFamily enum (only if needed)

3f. Thinking Levels (available_thinking_levels / default_thinking_level)

Phase 4 – Run Tests

4a. Enable parallel testing

4b. Smoke test — verify slug works

4c. Full test suite

4d. Extraction tests (if supports_doc_extraction=True)

4e. Revert parallel testing

4f. Test output format

Phase 5 – Create Pull Request

5.0 — Important context about Claude Code Web's stop hook

5.1 — Gate before pushing

5a. Commit and push

5b. Create the PR

Checklist

Provider Quirks Reference

Anthropic

OpenAI

Google/Gemini

DeepSeek

OpenRouter (general)

Qwen3 / Thinking Models

Thinking Levels Reference

Lookup Chain

Important Distinctions

Existing Constants

Sources That Do NOT Work

Slug Lookup Reference

LiteLLM Model Catalog (https://api.litellm.ai/model_catalog)

models.dev (https://models.dev/api.json)

Other verified sources

Lagging Providers

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

3a. `ModelName` enum

3b. `KilnModel` entry in `built_in_models`

3d. `suggested_for_evals` / `suggested_for_data_gen`

3e. `ModelFamily` enum (only if needed)

3f. Thinking Levels (`available_thinking_levels` / `default_thinking_level`)

4d. Extraction tests (if `supports_doc_extraction=True`)