Name: Sir Convert A Lot Qwen Finetuning
Author: paunchygent

SkillsPool

Buscar habilidades.../

Sir Convert A Lot Qwen Finetuning | Skills Pool

The main project target is full fine-tuning of the 1.7B base model.
Hemma is viable for bounded pilot work.
Colab H100 is the scale-up lane, not the only viable lane.
The end goal is general Swedish support, not a single teacher voice.
The first bounded Task 101 Hemma pilot must consume a deterministic training bundle projected from the frozen pilot root, not the generic promoted Task 103 preprocessing root.
The canonical repo-owned materialization surface for that bundle is:
- pdm run task-101-pilot-bundle build
The detached Task 101 runtime must record both the train and held-out eval manifest paths in launch/status/report metadata while staying explicit that upstream sft_12hz.py is still train-only and does not perform in-training evaluation.
Scheduled Task 101 runs now use the canonical 500/100/3 posture: durable checkpoint every 500 optimizer steps, held-out eval every 100 steps, retain newest 3 durable trainer-state checkpoints.
For older pre-schedule checkpoints, the canonical recovery order is: standalone held-out eval first, then resume only if the saved cursor is compatible with the current bundle contract.
If a legacy launch requires --pilot-bundle-root override, do not assume the saved intra-epoch cursor is still meaningful; treat any impossible cursor as a fail-closed condition, not a warning.
If a bounded recovery probe already produced a newer durable checkpoint with a compatible cursor, prefer that newer checkpoint for the next strict resume rather than resetting to the older legacy step.
If a preserved legacy launch still carries stale checkpoint cadence or retention values, pass explicit resume overrides so the relaunched lane truthfully matches the current 500/100/3 scheduled posture.
If a resumed Task 101 lane fails with repeated non-finite behavior, do not keep retrying blind full training runs. The canonical next step is: status -> diagnose-non-finite -> fix -> bounded retry.
Story 32 is now the governing protocol for active Qwen Task 101 experiment work:
- classify every active run as provenance, mechanism, or recovery
- keep one question per run
- record the full state vector in the Task 101 progress ledger before making causal claims
- use the promotion ladder: local gate -> short bounded fresh-start run -> longer governed proof
Current active surface matrix:
- qwen-t221-historical-control: provenance
- qwen-story31-stability-lab: mechanism
- governed qwen-train launch/status fresh-start proof lane: recovery, blocked until promotion
- qwen-story30-freshstart-proof and qwen-story30-backward-lineage: legacy-readonly
- qwen-t197-proof and qwen-t198-proof: deprecated for new work
Current operator truth:
- T221 is now resolved as negative recreated-control evidence: the recreated original-recipe shape plus only the T206 token-span fix still fails immediately under the current trainer/runtime
- treat that as provenance evidence only, not as a mechanism or recovery answer
- Story 31 remains the active mechanism lane
- T225 is complete as the exact parity contract
- T226 is now complete as the committed local parity-probe surface: pdm run qwen-story31-parity-probe run
- the live in-image historical-bundle run under task226-20260317t224307Z found no meaningful checkpoint divergence between the current and intended paths
- T219 is now recorded as negative bounded evidence under task219-20260317t180700z-a1
- T228 is now complete as the ranked closure of that family
- T229 is now complete as the narrowed rerun under task229-20260318t064712z-a1
- the target sub_talker_loss family localizes to talker_core.layer_16.input_layernorm
- T230 is now complete as the negative bounded normalization-entry rerun under task230-20260318t082049z-a1
- T231 is now complete as the explicit no-winner promotion decision
- T232 is now complete as the lane decision to stay in mechanism
- T233 is now complete as the normalization-internal rerun under task233-20260318t112544z-a1
- the first verified internal surface is now talker_core.layer_16.input_layernorm.output
- T234 is now complete under task234-20260318t123644z-a1
- no variant stayed finite or earned promotion; the strongest 0p5 member shifted the pair and line-13 sub_talker_loss cases to talker_core.layer_15.output, while line-4 still first broke at talker_core.layer_16.input_layernorm
- T235 is now complete under task235-20260318t140352z-a1
- the mixed sub_talker_loss result is repeatable: pair and line-13 stay at talker_core.layer_15.output, while line-4 stays at talker_core.layer_16.input_layernorm
- T236 is now complete under
Story 28 / T187-T191 is the permanent anti-god-file architecture lane for the Qwen training control plane and is now delivered. Keep new host-side logic in ml/qwen/training/control_plane/, detached launch logic in ml/qwen/training/detached_runtime/, reporting logic in ml/qwen/training/reporting/, and patched runtime logic in the bounded sft_12hz_* runtime modules. orchestrator.py and reporting.py are gone and must not be reintroduced.
Do not trust reused-run status.json or report.json artifacts unless they clearly belong to the active resumed container.
Intentional detached Task 101 stops now request graceful shutdown and one final durable checkpoint when progress advanced beyond the latest saved step.
Task 100/101 launch surfaces now emit an explicit BuildKit cold-build warning before heavy image compilation begins.
Any future production use must still fit the sidecar-only architecture from ADR-0006 and ADR-0007.
Long-running Hemma preprocessing, training, and corpus-acquisition work must run detached from the local client session.
When Colab runs persist status, logs, or spool JSON into Google Drive and the Drive connector is authenticated, inspect those artifacts directly before asking the user to run notebook-side status commands.
When the user provides a direct Drive link, prefer direct id-based metadata lookup before Drive search. Search can miss artifacts that are plainly reachable by id.
Hemma storage tiers are fixed:
- /srv/scratch for Docker root, HF/model caches, and hot generated preprocessing/training artifacts
- /srv/storage for raw Swedish corpora and colder retained datasets
For Qwen Docker workloads on Hemma, treat the storage and bind contracts as separate truths:
- /srv/scratch/sir-convert-a-lot/{build,cache} remains the canonical SSD storage truth
- /home/paunchygent/.data/sir-convert-a-lot/{build,cache} is the normal Docker-visible bind source under snap Docker
- run pdm run run-hemma -- pdm run qwen-docker-bind-roots status
- run pdm run run-hemma -- pdm run qwen-docker-bind-roots probe
- dynamic runtime bind fallback is compatibility-only after T242

task236-20260318t145434z-a1

Sir Convert A Lot Qwen Finetuning

Use This Skill When

Do Not Use This Skill For

Source of Truth

Sir Convert A Lot Qwen Finetuning

Use This Skill When

Do Not Use This Skill For

Source of Truth

First Move

Core Project Position

Qwen-Specific Decisions

Flash Attention Rules

Dataset Strategy

Qwen Workflow Overlay

Common Failure Modes

Promotion Rule

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns