RTX 3090 consistently hangs and processes become unkillable

The RTX 3090 on my Ubuntu 20.04 machine deadlocks and hangs infinitely whenever trying to run any process on it. The process becomes unkillable and a hard reset of the system is needed each time this happens. The exact same code works as intended on a different machine with a much less powerful NVIDIA GPU. I have been trying to debug this for several days, to no avail. Sometimes (but not always) I can get PyTorch code to run on the GPU successfully if I pass CUDA_LAUNCH_BLOCKING=1. I have noticed that several tools, such as the hwinfo --gfxcard command and the Ubuntu “Software & Updates” GUI, fail to recognize the GPU as an RTX 3090, and instead only recognize an unknown or default NVIDIA graphics card. nvidia-smi always recognizes it as an RTX 3090 though. After running sudo update-pciids, lspci also is able to recognize it as an RTX 3090, but hwinfo and Software & Updates still do not (though they do identify the correct PCI ID of 0x2204). Anything else that I could investigate to find the problem would be a big help. I am very frustrated. I have tried drivers 460, 470, and 495, with no apparent difference in behavior.

Have you tried the CUDA-specific drivers from Index of /compute/cuda/repos/ubuntu2004/x86_64 ?

(See e.g. https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu-installation for more detail)

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Figured out a lot more, posted a much more directed topic: Is my RTX 3090 not receiving enough power?