NVIDIA GPU + AI/ML stack engineer for RTX 5090 (Blackwell, sm_120) on Ubuntu 24.04. Handles driver installation, CUDA Toolkit, Docker GPU, PyTorch, vLLM, flash-attention setup and troubleshooting. Trigger this skill whenever the user mentions: nvidia drivers, CUDA install/update, GPU setup, ML stack, black screen after driver update, nvidia-smi errors, DKMS build failures, Docker GPU passthrough, PyTorch CUDA issues, or any NVIDIA/GPU-related configuration on their system.
Install, configure, diagnose, and update the full NVIDIA + ML stack on Ubuntu with RTX 5090 (Blackwell, sm_120).
Language: Russian for all communication. Technical terms and commands stay in English.
These rules prevent boot failures. RTX 5090 (Blackwell) only works with open-source GPU kernel modules — the proprietary driver causes a black screen on boot. This is documented by NVIDIA: "For cutting-edge platforms such as NVIDIA Blackwell, you must use the open-source GPU kernel modules."
CORRECT: sudo apt install nvidia-open
WRONG: sudo apt install cuda-drivers # proprietary → BLACK SCREEN
WRONG: sudo apt install cuda-drivers-5XX # proprietary → BLACK SCREEN
WRONG: sudo apt install nvidia-driver-5XX # proprietary → BLACK SCREEN
How to tell open from proprietary:
| Package pattern | Module type | Blackwell safe? |
|---|---|---|
nvidia-open, |
nvidia-open-5XX| Open (MIT/GPL) |
| Yes |
nvidia-dkms-open, nvidia-dkms-5XX-open | Open DKMS | Yes |
cuda-drivers, cuda-drivers-5XX | Proprietary | No — black screen |
nvidia-driver-5XX (no -open) | Proprietary | No — black screen |
nvidia-dkms-5XX (no -open) | Proprietary | No — black screen |
# All four must pass. If any says DANGER — do NOT reboot.
cat /proc/driver/nvidia/version 2>/dev/null | grep -i open && echo "OK: Open module" || echo "DANGER: NOT open!"
dkms status | grep "$(uname -r)" | grep installed && echo "OK: DKMS" || echo "DANGER: DKMS not built!"
ls /lib/modules/$(uname -r)/updates/dkms/nvidia*.ko* 2>/dev/null && echo "OK: Modules" || echo "DANGER: No modules!"
dpkg -l | grep -E 'nvidia-dkms.*open|nvidia-open' | grep ^ii && echo "OK: Packages" || echo "DANGER: Missing!"
DKMS builds for every kernel in /lib/modules/. An incompatible orphan kernel fails the entire build.
CURRENT=$(uname -r)
for dir in /lib/modules/*/; do
KVER=$(basename "$dir")
[ "$KVER" != "$CURRENT" ] && ! dpkg -l "linux-image-$KVER" 2>/dev/null | grep -q ^ii && echo "ORPHAN: $dir"
done
# Remove orphans: sudo rm -rf /lib/modules/<orphan-version>
cuda-keyring). Never runfile. Never nvidia-cuda-toolkit from Ubuntu repo.nvidia-open for Blackwell. Never cuda-drivers.nvidia-open) → CUDA Toolkit (apt: cuda-toolkit-12-X) → Python ML packages (pip in venv). Layers don't mix.--index-url https://download.pytorch.org/whl/cuXXX matching the installed CUDA Toolkit.apt install --only-upgrade <package>, never apt upgrade for a single package.Read the relevant reference file when needed — they contain the detailed commands, procedures, and data:
| Reference | When to read |
|---|---|
references/version-matrix.md | Driver → CUDA → PyTorch compatibility, package naming (590+), allowed sources |
references/diagnostic-commands.md | Phase 1 (system info) and Phase 7 (final validation) commands |
references/install-guide.md | Fresh install procedure (action = install) |
references/troubleshooting.md | Diagnosing failures (black screen, DKMS, Docker, PyTorch) |
references/docker-gpu.md | Docker GPU passthrough setup (scope includes docker) |
references/vllm-setup.md | vLLM installation (scope includes vllm) |
references/versions-baseline.md | Current system state, incident log, user doc paths |
Read $ARGUMENTS to determine:
audit (diagnostics only), install (from scratch), update, fix, plan (plan only). Default: audit.full.Run ALL diagnostic commands in parallel (single message). The full command set is in references/diagnostic-commands.md — read it and execute all sections (1.1–1.7) simultaneously.
WebSearch for each component in scope to find latest stable versions. Check:
nvidia-open version and kernel compatibilityAlso read user documentation if it exists — see paths in references/versions-baseline.md.
Determine status for each component:
| Component | Possible statuses |
|---|---|
| Driver | OK / OUTDATED / MISSING / CONFLICT / WRONG_TYPE (proprietary!) / RUNFILE |
| CUDA Toolkit | OK / OUTDATED / MISSING / CONFLICT / WRONG_SOURCE |
| Kernel/DKMS | OK / ORPHAN_KERNELS / DKMS_FAIL / INCOMPATIBLE |
| PATH/ENV | OK / MISCONFIGURED / MISSING |
| Docker GPU | OK / NO_RUNTIME / NO_TOOLKIT / BROKEN |
| PyTorch | OK / OUTDATED / WRONG_CU_INDEX / MISSING |
Run conflict checks:
nvcc (from /usr/local/cuda-XX.X/bin/)nvidia-cuda-toolkit from Ubuntu repowhich nvidia-uninstall is empty)If action = audit, output the report and stop. Otherwise, prepare a plan.
Each step must include: what it does, commands, verification command + expected result, rollback on failure. Mark sudo commands explicitly — user runs them via ! prefix (which doesn't support &&, so split long commands).
Step order: clean orphan kernels → driver → CUDA → Docker → Python venv → ML packages.
Include the pre-reboot checklist before any reboot step.
Before execution, verify:
nvidia-open (not cuda-drivers)apt upgrade without --only-upgradeShow plan to user and wait for confirmation.
! prefix.Run the full validation suite. For the complete command set, read references/diagnostic-commands.md section "Final Validation".
| Check | Expected |
|---|---|
/proc/driver/nvidia/version | Open Kernel Module |
nvidia-smi | RTX 5090, Driver ≥590.x |
nvcc -V | Matches installed cuda-toolkit |
which -a nvcc | Exactly one path |
| Conflicts | No nvidia-cuda-toolkit, no runfile |
| DKMS | installed for current kernel |
| Docker GPU | nvidia-smi works inside container |
| PyTorch | Correct cu-index, CUDA: True, compute test passes |
For common problems and solutions (black screen recovery, DKMS failures, Docker GPU issues, PyTorch CUDA errors, apt conflicts), read references/troubleshooting.md.
Always produce a final report:
## Result: [action] [scope]
### Steps completed
1. [step] — OK
### Current system state
| Component | Version | Type | Status |
|---|---|---|---|
| Driver | 595.x | open (nvidia-open) | OK |
| CUDA Toolkit | 12.9 | apt | OK |
| Docker GPU | ... | ... | OK |
| PyTorch | 2.x+cu129 | pip | OK |
### Recommendations
- [if any]