Name: Tinker Debug
Author: thinking-machines-lab

Tinker Debug

Diagnose training issues with Tinker — slow steps, hanging sessions, output mismatches, error messages, renderer problems, and deployment issues. Use this skill whenever a user reports that training is slow, steps take too long, sessions are hanging, model outputs differ between Tinker and external engines (vLLM, SGLang), they get a confusing error message, training quality is poor (high KL, bad outputs), or they suspect something is wrong. Also trigger when users ask "is this a Tinker issue or my issue?", "is Tinker down?", report unexpected wait times, see output quality regressions, get opaque errors, or want to profile/debug their training or deployment pipeline. This skill walks through systematic triage to determine root cause.

thinking-machines-lab3,105 Sterne10.04.2026

Beruf
Kategorien: Machine Learning

Systematic triage for training and deployment issues. Five triage paths:

Performance issues — slow steps, hanging sessions, throughput problems
Output correctness issues — mismatches between Tinker sampling and external inference engines
Service availability — "is Tinker down?" quick diagnostics
Renderer issues — wrong tokens, training quality degradation, prompt mismatches
Error message decoder — mapping opaque errors to root causes

Identify which category the user's problem falls into, then follow the appropriate triage.

How the Tinker SDK works (essential context)

Understanding the SDK's threading model is key to diagnosing most issues. The SDK runs a background thread with its own asyncio event loop. All network I/O, heartbeats, and API result polling happen on this thread.

┌─────────────────────┐     ┌──────────────────────────────┐
│     Main Thread      │     │      SDK Background Thread    │
│  (user code)         │     │  (asyncio event loop)         │
│                      │     │                               │
│  fb = tc.fwd_bwd_   │────>│  HTTP POST /forward_backward  │
│       async(data)    │     │  → returns request_id         │
│                      │     │                               │
│  # prepare next      │     │  Long-poll /retrieve_future   │
│  # batch here...     │     │  (HTTP 408 = not ready yet)   │
│                      │     │                               │
│  result = fb.result()│<────│  Result arrives → resolve     │
│  # blocks until done │     │                               │
│                      │     │  Heartbeat every 10s          │
└─────────────────────┘     └──────────────────────────────┘

Tinker Debug

thinking-machines-lab3,105 Sterne10.04.2026

Beruf
Kategorien: Machine Learning

How the Tinker SDK works (essential context)

┌─────────────────────┐ ┌──────────────────────────────┐ │ Main Thread │ │ SDK Background Thread │ │ (user code) │ │ (asyncio event loop) │ │ │ │ │ │ fb = tc.fwd_bwd_ │────>│ HTTP POST /forward_backward │ │ async(data) │ │ → returns request_id │ │ │ │ │ │ # prepare next │ │ Long-poll /retrieve_future │ │ # batch here... │ │ (HTTP 408 = not ready yet) │ │ │ │ │ │ result = fb.result()│<────│ Result arrives → resolve │ │ # blocks until done │ │ │ │ │ │ Heartbeat every 10s │ └─────────────────────┘ └──────────────────────────────┘

Tinker Debug

How the Tinker SDK works (essential context)

Tinker Debug

How the Tinker SDK works (essential context)

Triage order

Step 1: Environment check

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns