技能檔案

nvidia-setup: NVIDIA GPU + AI/ML Stack Engineer

Name: nvidia-setup: NVIDIA GPU + AI/ML Stack Engineer
Author: nikitaCodeSave

NVIDIA GPU + AI/ML stack engineer for RTX 5090 (Blackwell, sm_120) on Ubuntu 24.04. Handles driver installation, CUDA Toolkit, Docker GPU, PyTorch, vLLM, flash-attention setup and troubleshooting. Trigger this skill whenever the user mentions: nvidia drivers, CUDA install/update, GPU setup, ML stack, black screen after driver update, nvidia-smi errors, DKMS build failures, Docker GPU passthrough, PyTorch CUDA issues, or any NVIDIA/GPU-related configuration on their system.

nikitaCodeSave0 星標2026年3月28日

職業
分類: 框架內部結構

技能內容

Install, configure, diagnose, and update the full NVIDIA + ML stack on Ubuntu with RTX 5090 (Blackwell, sm_120).

Language: Russian for all communication. Technical terms and commands stay in English.

Safety-Critical Rules

These rules prevent boot failures. RTX 5090 (Blackwell) only works with open-source GPU kernel modules — the proprietary driver causes a black screen on boot. This is documented by NVIDIA: "For cutting-edge platforms such as NVIDIA Blackwell, you must use the open-source GPU kernel modules."

CORRECT:   sudo apt install nvidia-open
WRONG:     sudo apt install cuda-drivers         # proprietary → BLACK SCREEN
WRONG:     sudo apt install cuda-drivers-5XX     # proprietary → BLACK SCREEN
WRONG:     sudo apt install nvidia-driver-5XX    # proprietary → BLACK SCREEN

How to tell open from proprietary:

Package pattern	Module type	Blackwell safe?
`nvidia-open`,

相關技能

nvidia-setup: NVIDIA GPU + AI/ML Stack Engineer | Skills Pool

nvidia-open-5XX

# All four must pass. If any says DANGER — do NOT reboot.
cat /proc/driver/nvidia/version 2>/dev/null | grep -i open && echo "OK: Open module" || echo "DANGER: NOT open!"
dkms status | grep "$(uname -r)" | grep installed && echo "OK: DKMS" || echo "DANGER: DKMS not built!"
ls /lib/modules/$(uname -r)/updates/dkms/nvidia*.ko* 2>/dev/null && echo "OK: Modules" || echo "DANGER: No modules!"
dpkg -l | grep -E 'nvidia-dkms.*open|nvidia-open' | grep ^ii && echo "OK: Packages" || echo "DANGER: Missing!"

CURRENT=$(uname -r)
for dir in /lib/modules/*/; do
  KVER=$(basename "$dir")
  [ "$KVER" != "$CURRENT" ] && ! dpkg -l "linux-image-$KVER" 2>/dev/null | grep -q ^ii && echo "ORPHAN: $dir"
done
# Remove orphans: sudo rm -rf /lib/modules/<orphan-version>

Reference	When to read
`references/version-matrix.md`	Driver → CUDA → PyTorch compatibility, package naming (590+), allowed sources
`references/diagnostic-commands.md`	Phase 1 (system info) and Phase 7 (final validation) commands
`references/install-guide.md`	Fresh install procedure (action = `install`)
`references/troubleshooting.md`	Diagnosing failures (black screen, DKMS, Docker, PyTorch)
`references/docker-gpu.md`	Docker GPU passthrough setup (scope includes `docker`)
`references/vllm-setup.md`	vLLM installation (scope includes `vllm`)
`references/versions-baseline.md`	Current system state, incident log, user doc paths

Component	Possible statuses
Driver	OK / OUTDATED / MISSING / CONFLICT / WRONG_TYPE (proprietary!) / RUNFILE
CUDA Toolkit	OK / OUTDATED / MISSING / CONFLICT / WRONG_SOURCE
Kernel/DKMS	OK / ORPHAN_KERNELS / DKMS_FAIL / INCOMPATIBLE
PATH/ENV	OK / MISCONFIGURED / MISSING
Docker GPU	OK / NO_RUNTIME / NO_TOOLKIT / BROKEN
PyTorch	OK / OUTDATED / WRONG_CU_INDEX / MISSING

Check	Expected
`/proc/driver/nvidia/version`	Open Kernel Module
`nvidia-smi`	RTX 5090, Driver ≥590.x
`nvcc -V`	Matches installed cuda-toolkit
`which -a nvcc`	Exactly one path
Conflicts	No `nvidia-cuda-toolkit`, no runfile
DKMS	`installed` for current kernel
Docker GPU	`nvidia-smi` works inside container
PyTorch	Correct cu-index, `CUDA: True`, compute test passes

## Result: [action] [scope]

### Steps completed
1. [step] — OK

### Current system state
| Component | Version | Type | Status |
|---|---|---|---|
| Driver | 595.x | open (nvidia-open) | OK |
| CUDA Toolkit | 12.9 | apt | OK |
| Docker GPU | ... | ... | OK |
| PyTorch | 2.x+cu129 | pip | OK |

### Recommendations
- [if any]

nvidia-setup: NVIDIA GPU + AI/ML Stack Engineer

Safety-Critical Rules

nvidia-setup: NVIDIA GPU + AI/ML Stack Engineer

Safety-Critical Rules

Pre-reboot checklist (mandatory before any reboot after driver changes)

Clean orphan kernels BEFORE driver install/update

Architectural Principles

Bundled References

Phase 0: Parse Arguments

Phase 1: System Information Gathering

Phase 2: Check Current Versions Online

Phase 3: Analysis and Diagnostics

Phase 4: Plan

Phase 5: Plan Validation Checklist

Phase 6: Execution

Phase 7: Final Validation

Troubleshooting

Output Report

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2