Ubuntu20.04 + nvidia-driver-470 - card drops out after minimal use /delay after first cuda command but works w/nvidia-driver-460

On Operating System:
Ubuntu 20.04 LTS - kernel 5.11.0-37-generic (also happens on kernel 5.11.0-34-generic)

Cuda is functional on boot:
nvidia-smi -L
GPU 0: GeForce RTX 3060 (UUID: )

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   30C    P0    38W / 170W |      0MiB / 12051MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

works fine… at first. Can even dot matrixes in cupy. However after a few minutes (or hits of the driver, I can’t tell), it breaks / drops out and becomes a ‘no devices found’ that doesn’t respond to reset.

In fact installing any of the 470 drivers via the cuda repository(deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /) results in a seemingly functional cuda that crashes out after several minutes completely unrelated to code:

  • nvidia-driver-470
  • nvidia-driver-470-server
  • cuda-driver
  • cuda-driver-470
    etc. - specific versions of the driver in these packages appear to be either 470.63.0 or 470.57.02 - both exhibit this - simply running a watch nvidia-smi -L will crash it out after a few minutes.

The system will load the driver and appear functional (can even dot matrixes in cupy) but several seconds or minutes later after use the driver will unload and dump the following into dmesg:

[101527.059084] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0xffff:1204)
[101527.059107] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[101527.090044] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0xffff:1204)
[101527.090062] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

CUDA is then not functional until the system is rebooted.

This does not happen with the 460-series driver 460.91.03, but that only gets me functional cuda up to 11.2. An identical stack (cuda libs, packages, container runtime, etc.) is functioning perfectly with the 470 driver across a range of different cards (V100s, A100s, 1xxx series, 2xxx series - we use what we can get) - but it consistently fails w/3060 RTX cards (across 3 cards so far).

The cards are still listed as compatible so I expect them to work, if not necessarily be as fast as they are in theory. Is this incorrect?

Running out of things to check / try.

When the GPU fails, are there any status messages similar to this?

 NVRM: Xid (PCI:0000:03:00): 79, GPU has fallen off the bus.)

How many GPUs are in this system altogether? What are the system specs: CPU(s), amount of system memory? What is the wattage of the system power supply?

750W power supply, intel i5-11700k on a Z590 chipset, 64GB of RAM, just the one GPU (and the built-in on the mainboard). Systems are p. much idle when this happens (load 0/0/0).

Nothing matching that which I can see grepping through dmesg / syslog or in the nvidia-support log bundle - here’s what I think the relevant part of that is while inducing a drop-out from watch nvidia-smi -L:

Oct 07 00:09:49 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 510
Oct 07 00:09:49 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.63.01  Tue Aug  3 20:44:16 UTC 2021
Oct 07 00:09:49 kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.63.01  Tue Aug  3 20:30:55 UTC 2021
Oct 07 00:09:49  kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Oct 07 00:09:49  kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
Oct 07 00:09:49  kernel: nvidia-uvm: Loaded the UVM driver, major device number 508.
Oct 07 00:14:02  nvidia-persistenced[902]: Verbose syslog connection opened
Oct 07 00:14:02  nvidia-persistenced[902]: Now running with user ID 132 and group ID 135
Oct 07 00:14:02  nvidia-persistenced[902]: Started (902)
Oct 07 00:14:02  nvidia-persistenced[902]: device 0000:01:00.0 - registered
Oct 07 00:14:02  nvidia-persistenced[902]: Local RPC services initialized
Oct 08 01:32:20  kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0xffff:1204)
Oct 08 01:32:20  kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

(re-running nvidia smi results in the RmInitAdapter failed! message)