Diagnose training issues with Tinker — slow steps, hanging sessions, output mismatches, error messages, renderer problems, and deployment issues. Use this skill whenever a user reports that training is slow, steps take too long, sessions are hanging, model outputs differ between Tinker and external engines (vLLM, SGLang), they get a confusing error message, training quality is poor (high KL, bad outputs), or they suspect something is wrong. Also trigger when users ask "is this a Tinker issue or my issue?", "is Tinker down?", report unexpected wait times, see output quality regressions, get opaque errors, or want to profile/debug their training or deployment pipeline. This skill walks through systematic triage to determine root cause.
Systematic triage for training and deployment issues. Five triage paths:
Identify which category the user's problem falls into, then follow the appropriate triage.
Understanding the SDK's threading model is key to diagnosing most issues. The SDK runs a background thread with its own asyncio event loop. All network I/O, heartbeats, and API result polling happen on this thread.
┌─────────────────────┐ ┌──────────────────────────────┐
│ Main Thread │ │ SDK Background Thread │
│ (user code) │ │ (asyncio event loop) │
│ │ │ │
│ fb = tc.fwd_bwd_ │────>│ HTTP POST /forward_backward │
│ async(data) │ │ → returns request_id │
│ │ │ │
│ # prepare next │ │ Long-poll /retrieve_future │
│ # batch here... │ │ (HTTP 408 = not ready yet) │
│ │ │ │
│ result = fb.result()│<────│ Result arrives → resolve │
│ # blocks until done │ │ │
│ │ │ Heartbeat every 10s │
└─────────────────────┘ └──────────────────────────────┘
When you call forward_backward_async(), the SDK:
.result() on the future blocks main thread until the background thread resolves itWhy this matters for debugging: The background thread shares the Python GIL with the main thread. If user code holds the GIL for extended periods (heavy numpy/torch computation, CPU-bound data processing, slow serialization), the background thread cannot:
This means "my training is slow/hanging" is often caused by the user's own code blocking the SDK's background thread via GIL contention — not a network or server issue.
Work through these steps in order. Most issues are caught in steps 1-3 and never need deep profiling.
Bad dependency versions are a silent killer. Check these first because they're fast to verify and cause mysterious slowdowns that look like service issues.
import sys, pydantic, tinker
print(f"Python: {sys.version}")
print(f"pydantic: {pydantic.__version__}")
print(f"tinker SDK: {tinker.__version__}")