After trying for a while to figure out why my GPU was hanging and making unkillable processes, I figured out that it is likely not a software issue at all but an issue of insufficient power. Most CUDA Samples run and complete successfully, for example, but a handful cause the same issue. Whatever process was running hangs infinitely and becomes unkillable, and most fields in a looping
nvidia-smi call switch to “GPU is lost”. The CPU still functions fine though. To detect and try to use the GPU again, a hard reset of the machine is needed.
After some snooping, I have figured out that the issue arises whenever the GPU’s power draw exceeds ~130W. When setting the power limit to 100W with
nvidia-smi -pl 100, the same CUDA Samples and other programs that hung before now run successfully to completion. Slowing down the clock speed also has this effect, as long as it’s slow enough that the power draw does not exceed that same ~130W mark.
I learned that NVIDIA recommends a PSU of at least 750W for an RTX 3090, as its default power draw limit is 350W (though in practice it seems it can go a bit higher). The PSU I have is only 600W. Before I buy a new, more powerful PSU, does this behavior/diagnosis make sense? Isn’t 130W a lower threshold than you’d expect, even with the slightly underpowered PSU? Why doesn’t it seem to affect the CPU? I didn’t build this PC myself either, so I want to make sure I’m doing the right thing before I change it.
Thanks in advance for any help.