Isaac Sim 5.1 GPU Crashes on L40 in Virtualized Proxmox Environment

Hello NVIDIA Devs,

I am experiencing consistent crashes in Isaac Sim 5.1.0 running on Lubuntu 24.04 with a NVIDIA L40 GPU and driver 580.95.05. The environment is a virtual machine (Proxmox) with GPU passthrough.

Symptoms:

  • The crash occurs either a few seconds after stopping the simulation or even just by moving the camera around the robot in the scene.

  • The time until crash is variable, ranging from a few seconds up to ~2 minutes.

  • After a crash, if I try to restart Isaac Sim without rebooting the VM, it fails to start and shows: CUDA error 214: uncorrectable ECC error encountered

  • To restart Isaac Sim, I need to reboot the VM.

Additional details:

  • I have tried several NVIDIA driver versions: 535, 550, 570, 580 → the crash persists in all versions. The crash also happens on version 5.0.0 of Isaac Sim

  • The crash happens even with pre-made example scenes and simulations included in the Isaac Sim installer.

Attached logs:

  • Console output at Isaac Sim startup

  • Output at the moment of crash

  • Output when trying to restart Isaac Sim without rebooting the VM

isaac_restart.txt (9.8 KB)

console_before_crash.txt (68.9 KB)

console_at_crash.txt (32.9 KB)

Could you please advise if this is a known issue, if it is related to GPU passthrough or the fact that currently the GPU is running at PCIe Gen1 instead of Gen4 or something else, and whether there is a recommended workaround or fix?

thank you for your help!

Isaac Sim Version

⛝5.1.0
5.0.0
4.5.0
4.2.0
4.1.0
4.0.0
4.5.0
2023.1.1
2023.1.0-hotfix.1
Other (please specify):

Operating System

⛝ Ubuntu 24.04
Ubuntu 22.04
Ubuntu 20.04
Windows 11
Windows 10
Other (please specify):

GPU Information

  • Model: L40
  • Driver Version: 580.95.05

Thank you for posting this. Here are a few notes that may help.

Probable Causes

  • GPU Passthrough Limitations: Isaac Sim is highly demanding on system graphics resources and expects bare-metal hardware acceleration. VM passthrough of NVIDIA GPUs is possible, but issues are common with recent hardware and drivers, especially for the L40 on PCIe Gen1 rather than Gen4. Lower bandwidth results in unstable behavior and frequent device loss errors.
  • Uncorrectable ECC Error and CUDA Device Loss: After a crash, restart attempts often result in CUDA error 214, indicating uncorrectable ECC errors or device loss. This suggests that the GPU state may not be fully reset by the VM and requires a complete VM reboot for the device to become usable again.
  • Virtualization and PCIe Bandwidth: Running the L40 at PCIe Gen1 (instead of Gen4) drastically reduces available bandwidth, which can cause GPU memory errors and device loss during intensive rendering and simulation tasks.

Recommendations

  • Ensure PCIe Gen4 Bandwidth: For virtualized environments, verify that passthrough is configured for the full PCIe Gen4 link speed. Running on Gen1 often results in out-of-memory, device loss, and ECC errors.
  • Update Proxmox and VM Configuration: Ensure proper IOMMU and VFIO settings for passthrough, and double-check guest VM hardware configuration for CPU and memory allocation.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.