Nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:0:4048:4044

Hello all,

We are experiencing periodic full crashes of this card during high intensity training runs. The card has been in service for some time, so we know it is fully capable of supporting these workloads. Mid run, we will experience a kernel alert that the Gpu has fallen off the bus. We have captured two nvidia-bug-report dumps, along with journal and dmesg logging from the host node. We were previously running on driver version 515.65.01 and downgraded it back to (previously stable) 470.141.03. We have two other cards also running similar training on different neighbor nodes with the same OS baseline/software stack that are running fine (though we do, admittedly run those cards with less intensity than this node).

We cannot detect any power fluctuations or spikes in temperature prior to a crash, and these units are hooked up to a UPS so there should be no external power issues either. The temperature does not appear to be exceeding any danger thresholds - usually hovering around 80-82 C, no upwards trend at the end - hot, but not too hot.

Please find attached two separate bug bundles - dated with journal/dmesg and the nvidia-bug-report details, the earlier dated report running the later driver stack, and the newer one (today) running 470.141.03.

Note that BIOS/Firmware on the host node is up to date, and software is as well - both times before these log bundles were gathered, we fully removed the driver stack, cuda and all associated nvidia-reference modules, and re-installed them cleanly. Then observed a crash and gathered logs. (We omitted logs the first time just general local troubleshooting, so these reports represent known crash 2 and 3.)

We would appreciate if you can review the logs associated and let us know if you can spot the error, or if you feel this might be indicative of a failing card. We are open to purchasing a new one if needed, just want to be sure it’s not something we’ve missed with drivers or inference training or whatever.

An important detail is that we usually will be running these trainings over multiple days - these crashes usually take about 2-5 days to occur, so I can reasonably assume it’s not an issue with the training job (we don’t see cores spike or other mem/cpu jump issues either, the processing requirements of the job are intensive but not high fluctuation or variable in terms of pressure of requests on the card.

We also see that when this issue occurs we can’t reboot remotely because the card never unloads, which is of particular annoyance working remotely - requires a physical shutdown

Appreciate your time and assistance in diagnosing card loss issue; primarily seeking confirmation that the card is failing unless there’s something else I missed. Can supply heat/power log data as needed also.

nvidia_bugbundle.tar.gz (802.4 KB)

1 Like

Currently running the following version/kernel:

Ubuntu 20.04.05 LTS
5.15.0-50-generic

I am returning from the future to let folks know that this was solved by rolling back to the previous driver version in use which is stable for this GPU: 470.182.03.

I have no idea why, but this card absolutely does not like to play with any of the 5xx build driver versions without crashing. (Occurred again with 530)

nvidia rtx 3090

Thank you for updating us, future @scotchman0.

Since you have it running successfully now I rather not suggest any changes.

But, in both debug logs you find this:

Oct 02 11:12:13 emboleye3 kernel: NVRM: API mismatch: the client has the version 515.65.01, but
                                  NVRM: this kernel module has the version 470.94.  Please
                                  NVRM: make sure that this kernel module and all NVIDIA driver
                                  NVRM: components have the same version.

Oct 10 11:00:56 emboleye3 kernel: NVRM: API mismatch: the client has the version 470.141.03, but
                                  NVRM: this kernel module has the version 515.65.01.  Please
                                  NVRM: make sure that this kernel module and all NVIDIA driver
                                  NVRM: components have the same version.

Which indicates an installation issue. The install log also shows several symlinks not being cleanly removed, so a complete manual purge and reinstall might change something.

But as I said, seems you fixed that part by rolling back already. Only if you need to update the driver you should consider a complete clean re-install of the System.

i’ve found the same problem with the same gpu but ‘535*’ driver. By the way, The card in question will also heat up in idle mode, even though ‘nvidia-smi’ shows the temperature is normal.