GPU errors during CUDA-based computations

We have a GPU server with 4 NVIDIA Geforce RTX 3090 24 GB that is used for machine learning based on pytorch. 1 of the 4 GPU shows the following behavior: When loading a tensor on the GPU, it works fine. But when starting a computation, the execution stops after a short time with varying error messages. It is striking that even after the computation stopped, that the GPU utilization remains at a high level (e.g. 90 %) with no memory usage.

gpu_error
When swapping the GPU slots, e.g. moving the faulty GPU from slot 0 to slot 1, the error now occurs at the GPU in slot 1. Furthermore, the failing computations involve backpropagation. Another striking detail is that the GPUs where everything seems to work fine have VBIOS version 94.02.42.00.B0 while the failing card has VBIOS version 94.02.42.80.65.

The most frequent error message was RuntimeError: CUDA error: an illegal memory access was encountered. Further error messages where RuntimeError: CUDA error: the launch timed out and was terminated, RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR and RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR.

System: Ubuntu 20.04 LTS

Any suggestions? Thanks!

an illegal memory access would normally be something that your code (or pytorch code) is doing that is illegal. That’s not necessarily a problem that can be solved without specific debugging. Likewise, a launch timeout is a kernel that is running for too long.

You may wish to check whether your GPUs have a kernel runtime limit, using e.g. deviceQuery. If I had a setup like this, running on linux, I would be sure to disable X or any graphical desktop that is using any of these GPUs.

That (cyclical exchange of GPUs) is a good technique to identify whether the problem is with the slot or the card. So in this case, you found the problem follows the GPU.

In over a decade of using GPUs of all kinds, I have never encountered a problem related to a VBIOS issue, and in particular a minor VBIOS version difference. That doesn’t mean such an issue could not exist. But it seem unlikely based on historical precedence.

We have, however, had quite a few reports in these forums where people aggressively stuff high-end GPUs into a single machine and experiencing power supply and cooling issues as a consequence. I assume your system has a >= 2400W power supply.

AI / ML software appears to do a great job of highlighting such issues. We have also had one recent case of a GPU behaving weirdly that ultimately turned out to be an intermittent defect in the power connector on the GPU (possible scenarios include cold soldering joint or a hairline fracture due to mechanical stress).

So if after following up on Robert_Crovella’s recommendation you still experience issues with this GPU, check the system logs for error message pertaining to GPUs and check power supply and thermals for the GPUs.

Hi Robert, thanks for your answer. The GPUs are used by several scientists that are running different code - therefore we think it is unlikely that it is code-related.

I will check for the kernel runtime limit.
The system is running with an X surface due to the user’s requirements. I will see how it behaves if I disable X.

The purchase of a system with consumer GPU was only an interim solution due to the scarcity of hardware some time ago. By now, we also have to systems with passively-cooled A40 cards.

The system with the 3900’s does not really show elevated temperatures or power issues. It has a redundant power supply.

That by itself doesn’t mean much. With four high-end GPUs in the system (350W TDP each) I would assume the system sports at least one high-end CPU and copious amounts of system memory, say 256 GB. Assuming a TDP of the host system of 200W, you are looking at 1600W TDP. But TDP (thermal design power) is intended to characterize the power draw with respect to thermals, which means averaged over multiple minutes.

TDP does not say anything about instantaneous power (on the scale of milliseconds), which is what the PSU needs to deliver to keep the system running stably. In both modern CPUs and modern GPUs short-term power spikes up to 40% above TDP have been observed. To build a system that is rock solid over a useful lifetime of 4 years, and considering that electronic components physically age, a good rule of thumb is to have the total TDP of all system components not exceed 60% of nominal PSU output by much. So for you machine that is probably in the 2400W to 2700W range.

Brown-outs caused by PSUs not keeping up with instantaneous power draw have caused all kind of weird malfunctions up to spontaneous re-boots, characteristically occurring several minutes into running a heavy AI/ML workload. That is based on observation of numerous reports in these forums.

Also based on observing reports in these forums: A contributing factor is sometimes that people install more GPUs in a system than their PSU has dedicated PCIe auxilliary power connectors for, then attempting to work around this issue with the use of converters, splitters, or daisy chaining. That is an excellent recipe for an unstable system with weird problems. Not recommended.

Given that your problem seems to follow the GPU, a hardware defect in that GPU is also possible. I do not have statistics at hand (only NVIDIA or large distributors would have those based on RMA volume) but my impression is that hardware defects in modern GPUs are rare.

deviceQuery shows no kernel runtime limit. Will check with disabled X next, and will also look into the power connectors / power supply etc.

Thx again.