Hello all,
We are experiencing periodic full crashes of this card during high intensity training runs. The card has been in service for some time, so we know it is fully capable of supporting these workloads. Mid run, we will experience a kernel alert that the Gpu has fallen off the bus. We have captured two nvidia-bug-report dumps, along with journal and dmesg logging from the host node. We were previously running on driver version 515.65.01
and downgraded it back to (previously stable) 470.141.03. We have two other cards also running similar training on different neighbor nodes with the same OS baseline/software stack that are running fine (though we do, admittedly run those cards with less intensity than this node).
We cannot detect any power fluctuations or spikes in temperature prior to a crash, and these units are hooked up to a UPS so there should be no external power issues either. The temperature does not appear to be exceeding any danger thresholds - usually hovering around 80-82 C, no upwards trend at the end - hot, but not too hot.
Please find attached two separate bug bundles - dated with journal/dmesg and the nvidia-bug-report details, the earlier dated report running the later driver stack, and the newer one (today) running 470.141.03.
Note that BIOS/Firmware on the host node is up to date, and software is as well - both times before these log bundles were gathered, we fully removed the driver stack, cuda and all associated nvidia-reference modules, and re-installed them cleanly. Then observed a crash and gathered logs. (We omitted logs the first time just general local troubleshooting, so these reports represent known crash 2 and 3.)
We would appreciate if you can review the logs associated and let us know if you can spot the error, or if you feel this might be indicative of a failing card. We are open to purchasing a new one if needed, just want to be sure it’s not something we’ve missed with drivers or inference training or whatever.
An important detail is that we usually will be running these trainings over multiple days - these crashes usually take about 2-5 days to occur, so I can reasonably assume it’s not an issue with the training job (we don’t see cores spike or other mem/cpu jump issues either, the processing requirements of the job are intensive but not high fluctuation or variable in terms of pressure of requests on the card.
We also see that when this issue occurs we can’t reboot remotely because the card never unloads, which is of particular annoyance working remotely - requires a physical shutdown
Appreciate your time and assistance in diagnosing card loss issue; primarily seeking confirmation that the card is failing unless there’s something else I missed. Can supply heat/power log data as needed also.
nvidia_bugbundle.tar.gz (802.4 KB)