The RTX 3090 on my Ubuntu 20.04 machine deadlocks and hangs infinitely whenever trying to run any process on it. The process becomes unkillable and a hard reset of the system is needed each time this happens. The exact same code works as intended on a different machine with a much less powerful NVIDIA GPU. I have been trying to debug this for several days, to no avail. Sometimes (but not always) I can get PyTorch code to run on the GPU successfully if I pass CUDA_LAUNCH_BLOCKING=1. I have noticed that several tools, such as the hwinfo --gfxcard
command and the Ubuntu “Software & Updates” GUI, fail to recognize the GPU as an RTX 3090, and instead only recognize an unknown or default NVIDIA graphics card. nvidia-smi
always recognizes it as an RTX 3090 though. After running sudo update-pciids
, lspci
also is able to recognize it as an RTX 3090, but hwinfo and Software & Updates still do not (though they do identify the correct PCI ID of 0x2204). Anything else that I could investigate to find the problem would be a big help. I am very frustrated. I have tried drivers 460, 470, and 495, with no apparent difference in behavior.
Have you tried the CUDA-specific drivers from Index of /compute/cuda/repos/ubuntu2004/x86_64 ?
(See e.g. https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu-installation for more detail)
Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.
Figured out a lot more, posted a much more directed topic: Is my RTX 3090 not receiving enough power?