NVIDIA open driver crash under Ollama CUDA workload (Xid 79 -> GPU fallen off bus -> reboot required)

Summary

  • Running Ollama with CUDA on a hybrid Intel + NVIDIA laptop causes the NVIDIA GPU to enter a fatal state.
  • After failure, nvidia-smi reports no usable device until full reboot.
  • This is reproducible with Ollama workload and not fixed by restarting Ollama alone.

Environment (sanitized)

  • OS: Ubuntu 24.04.4 LTS
  • Kernel: 6.17.0-14-generic
  • GPU: NVIDIA GeForce RTX 4060 Laptop GPU (AD107M)
  • PRIME mode: on-demand
  • Driver stack: nvidia-driver-570-open 570.211.01
  • Relevant module setting: NVreg_DynamicPowerManagement=0x02 (enabled by default runtime PM config)
  • Ollama client/server: 0.15.2

Observed behavior

  1. Ollama uses CUDA normally at first.
  2. During inference, Ollama logs CUDA launch failure.
  3. Kernel logs Xid 79 and “GPU has fallen off the bus”.
  4. Driver escalates to Xid 154 and reports reboot required.
  5. Afterwards:
    • nvidia-smi => “Unable to determine the device handle … Unknown Error / No devices were found”
    • Live reset attempts do not recover GPU
    • Only full reboot restores operation

Key evidence

  • Ollama log:

    • “CUDA error: unspecified launch failure”
    • “ggml_cuda_init: failed to initialize CUDA: unknown error” (after crash)
  • Kernel log:

    • “NVRM: Xid (PCI:0000:01:00): 79 … GPU has fallen off the bus.”
    • “NVRM: Xid (PCI:0000:01:00): 154 … Node Reboot Required”
    • “NVRM: nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, requiring os reboot to recover.”
  • PCI state after failure:

    • lspci -vv -s 01:00.0 shows “Unknown header type 7f”
    • GPU remains present on bus ID but is not manageable by NVML/CUDA

Additional signal

  • Repeated ACPI warning around resume events:
    • “ACPI Error: No handler or method for GPE 6B, disabling event”
  • Hybrid + runtime power management path is active; failure may be in driver/firmware power-state handling under CUDA load.

Reproduction (minimal)

  1. Boot system normally with nvidia-driver-570-open and PRIME on-demand.
  2. Start Ollama and run a CUDA-backed inference request.
  3. Repeat requests until failure occurs.
  4. Observe Xid 79 / Xid 154 in kernel log and loss of GPU usability until reboot.

Expected

  • CUDA workload should either succeed or fail gracefully without losing PCIe/NVML access to GPU.

Actual

  • Driver loses GPU from bus (Xid 79), enters non-recoverable state (Xid 154), requires reboot.

Current root-cause statement

  • Immediate root cause (confirmed): NVIDIA kernel driver enters GPU-lost condition (Xid 79, bus loss) during Ollama CUDA activity, then marks GPU unrecoverable (Xid 154, reboot required).
  • Most likely underlying cause: bug/regression in NVIDIA open-kernel hybrid-runtime-power-management path (PCIe/power-state transition + CUDA load), potentially aggravated by resume/ACPI event issues.
  • Note: user-space Ollama triggers the path, but the fatal condition is in the GPU driver/kernel stack.

What has already been tried

  • Restarting Ollama service: does not recover GPU.
  • Runtime reset attempts: no recovery.
  • Reboot: consistently recovers GPU (only fix that seems to work)

Cannot reproduce with a different setup:

  • Arch Linux entirely, except for the kernel and headers packages
  • linux-cachyos 6.19.4-2
  • nvidia-vulkan from AUR, version 580.94.18-1
  • cuda 13.1.1-1
  • ollama-cuda 0.17.4-1
  • RTX 5070 Ti

I can run the lfm2 model with repeated queries without the GPU dropping off the bus.

1 Like

Since it’s a laptop, you can probably exclude the PSU ;-) Check your fans and thermal paste state. Then check for firmware updates for your model.

Thanks for the feedback.

@kode54

That’s helpful context. Your setup differs in several important areas:

  • Driver branch: 580.94.18 vs my 570-open (570.211.01)

  • Kernel: 6.19 vs 6.17

  • Distro: Arch vs Ubuntu 24.04

  • GPU: RTX 5070 Ti vs RTX 4060 Laptop (AD107M)

Since you’re on the newer 580 branch and a newer kernel, I’m wondering if this might be related to the hybrid laptop runtime power management path rather than CUDA itself.

@morgwai666

I agree Xid 79 can sometimes indicate hardware instability. A few things here make it look more like a driver/power-state issue though:

  • The failure is consistently reproducible under CUDA inference via Ollama

  • It does not occur during normal desktop usage or non-CUDA workloads

  • The GPU consistently follows the Xid 79 > Xid 154 escalation path

  • A reboot fully restores the system with no lingering instability

  • Thermals during inference are normal

Given that this is a PRIME on-demand hybrid setup with NVreg_DynamicPowerManagement=0x02 enabled, I’m starting to suspect something in the runtime power-state transition path under sustained CUDA load.

I plan to stick with the open-source driver branch and would prefer not to disable runtime power management, since that could increase heat on a laptop.

If anyone from NVIDIA can confirm whether there were fixes in 580 related to Xid 79 or hybrid runtime PM behavior, that would help narrow this down.

PS: I previously tested 580-open and saw the same issue. I downgraded to 570-open and held packages, still the same issue. I then upgraded back to 580-open again and it still reproduces.

i do have the latest available update for the firmware.

thanks, sadly nobody has helped with this issue yet to get it resolved, i’m wondering if any of the developers have seen this post yet.

Unfortunately, NV’s Linux desktop team is criminally understaffed: they are unable to keep up with major bugs affecting multiple users: see the release feedback threads…

i’ve just contacted them directly to see if they can help me solve this issue, thanks.