Summary
- Running Ollama with CUDA on a hybrid Intel + NVIDIA laptop causes the NVIDIA GPU to enter a fatal state.
- After failure,
nvidia-smireports no usable device until full reboot. - This is reproducible with Ollama workload and not fixed by restarting Ollama alone.
Environment (sanitized)
- OS: Ubuntu 24.04.4 LTS
- Kernel: 6.17.0-14-generic
- GPU: NVIDIA GeForce RTX 4060 Laptop GPU (AD107M)
- PRIME mode: on-demand
- Driver stack: nvidia-driver-570-open 570.211.01
- Relevant module setting: NVreg_DynamicPowerManagement=0x02 (enabled by default runtime PM config)
- Ollama client/server: 0.15.2
Observed behavior
- Ollama uses CUDA normally at first.
- During inference, Ollama logs CUDA launch failure.
- Kernel logs Xid 79 and “GPU has fallen off the bus”.
- Driver escalates to Xid 154 and reports reboot required.
- Afterwards:
nvidia-smi=> “Unable to determine the device handle … Unknown Error / No devices were found”- Live reset attempts do not recover GPU
- Only full reboot restores operation
Key evidence
-
Ollama log:
- “CUDA error: unspecified launch failure”
- “ggml_cuda_init: failed to initialize CUDA: unknown error” (after crash)
-
Kernel log:
- “NVRM: Xid (PCI:0000:01:00): 79 … GPU has fallen off the bus.”
- “NVRM: Xid (PCI:0000:01:00): 154 … Node Reboot Required”
- “NVRM: nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, requiring os reboot to recover.”
-
PCI state after failure:
lspci -vv -s 01:00.0shows “Unknown header type 7f”- GPU remains present on bus ID but is not manageable by NVML/CUDA
Additional signal
- Repeated ACPI warning around resume events:
- “ACPI Error: No handler or method for GPE 6B, disabling event”
- Hybrid + runtime power management path is active; failure may be in driver/firmware power-state handling under CUDA load.
Reproduction (minimal)
- Boot system normally with nvidia-driver-570-open and PRIME on-demand.
- Start Ollama and run a CUDA-backed inference request.
- Repeat requests until failure occurs.
- Observe Xid 79 / Xid 154 in kernel log and loss of GPU usability until reboot.
Expected
- CUDA workload should either succeed or fail gracefully without losing PCIe/NVML access to GPU.
Actual
- Driver loses GPU from bus (Xid 79), enters non-recoverable state (Xid 154), requires reboot.
Current root-cause statement
- Immediate root cause (confirmed): NVIDIA kernel driver enters GPU-lost condition (
Xid 79, bus loss) during Ollama CUDA activity, then marks GPU unrecoverable (Xid 154, reboot required). - Most likely underlying cause: bug/regression in NVIDIA open-kernel hybrid-runtime-power-management path (PCIe/power-state transition + CUDA load), potentially aggravated by resume/ACPI event issues.
- Note: user-space Ollama triggers the path, but the fatal condition is in the GPU driver/kernel stack.
What has already been tried
- Restarting Ollama service: does not recover GPU.
- Runtime reset attempts: no recovery.
- Reboot: consistently recovers GPU (only fix that seems to work)