Debug and operate the interpretune self-hosted Azure GPU pipeline, including PAT-backed approval release, queue triage, worker dispatch checks, phase-split test diagnosis, and memory-aware fixture narrowing.
Use this skill when an interpretune self-hosted GPU Azure DevOps run is queued, failing, or suspected to be exhausting memory.
notStarted after a PR becomes ready for review137, receives a shutdown signal, or otherwise fails under load.azure-pipelines/gpu-tests.ymlDefault pool, but a queued build may still show queue.name = Azure Pipelines at the build levelAZURE_DEVOPS_EXT_PAT is the preferred non-interactive authentication path for az devops and Azure DevOps REST callsspeediedl:
MemoryMax/MemoryHigh and low OOMScoreAdjustTesting: standard is CPU-only with CUDA_VISIBLE_DEVICES=''Testing: standard gpu cuda-marked runs regular CUDA-gated tests under IT_RUN_CUDA_TESTS=1Testing: standalone gpu runs standalone GPU testsTesting: CI Profiling runs profiling GPU testsprintenv AZURE_DEVOPS_EXT_PAT | wc -c
az pipelines build show --id <build_id> --organization https://dev.azure.com/speediedan --project interpretune -o table
curl -sS -u ":${AZURE_DEVOPS_EXT_PAT}" \
"https://dev.azure.com/speediedan/interpretune/_apis/pipelines/approvals?state=pending&api-version=7.1-preview.1"
Interpretation:
notStarted and approvals are pending, approve the run before touching the agentApprove the pending gate directly from the shell:
curl -sS -X PATCH -u ":${AZURE_DEVOPS_EXT_PAT}" \
-H "Content-Type: application/json" \
-d '[{"approvalId":"<approval_id>","status":"approved","comment":"Approved via CLI for self-hosted GPU validation."}]' \
"https://dev.azure.com/speediedan/interpretune/_apis/pipelines/approvals?api-version=7.1-preview.1"
watch -n 30 'az pipelines build show --id <build_id> --organization https://dev.azure.com/speediedan --project interpretune --query "{status:status,result:result,startTime:startTime,finishTime:finishTime}" -o json'
tail -f /opt/az_pipeline_agent/_diag/Agent_*.log
ls -1t /opt/az_pipeline_agent/_diag/Worker_*.log | head
az pipelines agent list --organization https://dev.azure.com/speediedan --pool-id 1 -o table
Interpretation:
notStartedAction:
Action:
/var/run/docker.sock symlink handling and agent service health137Action:
IT_RUN_CUDA_TESTS=1, standalone, and profile_ci slicesUse the local Azure reproduction flow in distributed-insight to recreate the containerized runner context. Start with the same phase that failed remotely.
Useful commands:
CUDA_VISIBLE_DEVICES='' python -m pytest --cov=src/interpretune --cov-append --cov-report= src/interpretune tests -v --reruns 2 --reruns-delay 5
IT_RUN_CUDA_TESTS=1 python -m pytest --cov=src/interpretune --cov-append --cov-report= tests -v --durations=50 --reruns 2 --reruns-delay 5
bash ./tests/special_tests.sh --mark_type=standalone
bash ./tests/special_tests.sh --mark_type=profile_ci
When CUDA tests still carry too much memory:
fixture_usage.instructions.md before changing test fixturestests/analysis_resource_utils.pyanalysis_fixture_scope() so low-RAM runners can fall back to function scope while higher-RAM runners keep class reuseAnalysisExtractionMixin and declarative AnalysisFixtureSpec entries over parity-local extraction helpersqueue.name of Azure Pipelines means the YAML job pool changedAfter following this skill you should know: