Nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:0:4048:4044

scotchman0 · October 19, 2022, 8:00pm

Hello all,

We are experiencing periodic full crashes of this card during high intensity training runs. The card has been in service for some time, so we know it is fully capable of supporting these workloads. Mid run, we will experience a kernel alert that the Gpu has fallen off the bus. We have captured two nvidia-bug-report dumps, along with journal and dmesg logging from the host node. We were previously running on driver version 515.65.01 and downgraded it back to (previously stable) 470.141.03. We have two other cards also running similar training on different neighbor nodes with the same OS baseline/software stack that are running fine (though we do, admittedly run those cards with less intensity than this node).

We cannot detect any power fluctuations or spikes in temperature prior to a crash, and these units are hooked up to a UPS so there should be no external power issues either. The temperature does not appear to be exceeding any danger thresholds - usually hovering around 80-82 C, no upwards trend at the end - hot, but not too hot.

Please find attached two separate bug bundles - dated with journal/dmesg and the nvidia-bug-report details, the earlier dated report running the later driver stack, and the newer one (today) running 470.141.03.

Note that BIOS/Firmware on the host node is up to date, and software is as well - both times before these log bundles were gathered, we fully removed the driver stack, cuda and all associated nvidia-reference modules, and re-installed them cleanly. Then observed a crash and gathered logs. (We omitted logs the first time just general local troubleshooting, so these reports represent known crash 2 and 3.)

We would appreciate if you can review the logs associated and let us know if you can spot the error, or if you feel this might be indicative of a failing card. We are open to purchasing a new one if needed, just want to be sure it’s not something we’ve missed with drivers or inference training or whatever.

An important detail is that we usually will be running these trainings over multiple days - these crashes usually take about 2-5 days to occur, so I can reasonably assume it’s not an issue with the training job (we don’t see cores spike or other mem/cpu jump issues either, the processing requirements of the job are intensive but not high fluctuation or variable in terms of pressure of requests on the card.

We also see that when this issue occurs we can’t reboot remotely because the card never unloads, which is of particular annoyance working remotely - requires a physical shutdown

Appreciate your time and assistance in diagnosing card loss issue; primarily seeking confirmation that the card is failing unless there’s something else I missed. Can supply heat/power log data as needed also.

nvidia_bugbundle.tar.gz (802.4 KB)

scotchman0 · October 24, 2022, 4:28pm

Currently running the following version/kernel:

Ubuntu 20.04.05 LTS
5.15.0-50-generic

scotchman0 · April 23, 2023, 3:52pm

I am returning from the future to let folks know that this was solved by rolling back to the previous driver version in use which is stable for this GPU: 470.182.03.

I have no idea why, but this card absolutely does not like to play with any of the 5xx build driver versions without crashing. (Occurred again with 530)

nvidia rtx 3090

MarkusHoHo · April 26, 2023, 2:44pm

Thank you for updating us, future @scotchman0.

Since you have it running successfully now I rather not suggest any changes.

But, in both debug logs you find this:

Oct 02 11:12:13 emboleye3 kernel: NVRM: API mismatch: the client has the version 515.65.01, but
                                  NVRM: this kernel module has the version 470.94.  Please
                                  NVRM: make sure that this kernel module and all NVIDIA driver
                                  NVRM: components have the same version.

Oct 10 11:00:56 emboleye3 kernel: NVRM: API mismatch: the client has the version 470.141.03, but
                                  NVRM: this kernel module has the version 515.65.01.  Please
                                  NVRM: make sure that this kernel module and all NVIDIA driver
                                  NVRM: components have the same version.

Which indicates an installation issue. The install log also shows several symlinks not being cleanly removed, so a complete manual purge and reinstall might change something.

But as I said, seems you fixed that part by rolling back already. Only if you need to update the driver you should consider a complete clean re-install of the System.

user88426 · March 5, 2025, 7:25am

i’ve found the same problem with the same gpu but ‘535*’ driver. By the way, The card in question will also heat up in idle mode, even though ‘nvidia-smi’ shows the temperature is normal.

Topic		Replies	Views
GPU has fallen off the bus - GTX 1070 - nvidia-gfxG04-kmp-default-390.87 [Solved - dead GPU] Linux	9	1707	October 4, 2018
Please Help Another NVRM: GPU 0000:01:00.0: GPU has fallen off the bus. RTX4090 Linux	1	1164	August 18, 2023
RTX 2070 - NVIDIA 510.60.02 - Freezes (GPU has fallen off the Bus) Linux	6	787	March 28, 2022
Crash on RTX 6000 Ada on Ubuntu 24.04 "GPU has fallen off the bus" Linux llama	8	196	March 14, 2025
1080 Ti always dies shortly after strarting training, cuda 11.5, driver 495.29.05 Drivers - Linux, Windows, MacOS cuda	2	751	January 31, 2022
Arbitrary Crashes / Segfaults with RTX 3070 on current driver-455 on Ubuntu 20.04 kernel 5.4.0-58-generic Linux	23	2167	February 25, 2021
Freeze with GPU has fallen off the bus on RTX 3080 16GB Laptop (AORUS 15P YD) Linux kernel , linux , driver	2	2211	March 11, 2022
No signal to monitor since "NVRM: GPU at 0000:01:00.0 has fallen off the bus" Linux	12	5511	August 26, 2013
Ubuntu 20.04 - RTX3090 - GPU has fallen off the bus Linux cuda , tensorflow , ubuntu , linux	6	4112	December 26, 2021
GPU has fallen off the bus \| GPU crashes after a while under load (ie. playing games) Linux	22	8198	October 14, 2021

Nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:0:4048:4044

Related topics