Ubuntu20.04 + nvidia-driver-470 - card drops out after minimal use /delay after first cuda command but works w/nvidia-driver-460

nick.cammorato · October 7, 2021, 12:18am

On Operating System:
Ubuntu 20.04 LTS - kernel 5.11.0-37-generic (also happens on kernel 5.11.0-34-generic)

Cuda is functional on boot:
nvidia-smi -L
GPU 0: GeForce RTX 3060 (UUID: )

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   30C    P0    38W / 170W |      0MiB / 12051MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

works fine… at first. Can even dot matrixes in cupy. However after a few minutes (or hits of the driver, I can’t tell), it breaks / drops out and becomes a ‘no devices found’ that doesn’t respond to reset.

In fact installing any of the 470 drivers via the cuda repository(deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /) results in a seemingly functional cuda that crashes out after several minutes completely unrelated to code:

nvidia-driver-470
nvidia-driver-470-server
cuda-driver
cuda-driver-470
etc. - specific versions of the driver in these packages appear to be either 470.63.0 or 470.57.02 - both exhibit this - simply running a watch nvidia-smi -L will crash it out after a few minutes.

The system will load the driver and appear functional (can even dot matrixes in cupy) but several seconds or minutes later after use the driver will unload and dump the following into dmesg:

[101527.059084] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0xffff:1204)
[101527.059107] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[101527.090044] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0xffff:1204)
[101527.090062] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

CUDA is then not functional until the system is rebooted.

This does not happen with the 460-series driver 460.91.03, but that only gets me functional cuda up to 11.2. An identical stack (cuda libs, packages, container runtime, etc.) is functioning perfectly with the 470 driver across a range of different cards (V100s, A100s, 1xxx series, 2xxx series - we use what we can get) - but it consistently fails w/3060 RTX cards (across 3 cards so far).

The cards are still listed as compatible so I expect them to work, if not necessarily be as fast as they are in theory. Is this incorrect?

Running out of things to check / try.

njuffa · October 7, 2021, 10:41pm

When the GPU fails, are there any status messages similar to this?

 NVRM: Xid (PCI:0000:03:00): 79, GPU has fallen off the bus.)

How many GPUs are in this system altogether? What are the system specs: CPU(s), amount of system memory? What is the wattage of the system power supply?

nick.cammorato · October 8, 2021, 1:53am

750W power supply, intel i5-11700k on a Z590 chipset, 64GB of RAM, just the one GPU (and the built-in on the mainboard). Systems are p. much idle when this happens (load 0/0/0).

Nothing matching that which I can see grepping through dmesg / syslog or in the nvidia-support log bundle - here’s what I think the relevant part of that is while inducing a drop-out from watch nvidia-smi -L:

Oct 07 00:09:49 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 510
Oct 07 00:09:49 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.63.01  Tue Aug  3 20:44:16 UTC 2021
Oct 07 00:09:49 kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.63.01  Tue Aug  3 20:30:55 UTC 2021
Oct 07 00:09:49  kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Oct 07 00:09:49  kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
Oct 07 00:09:49  kernel: nvidia-uvm: Loaded the UVM driver, major device number 508.
Oct 07 00:14:02  nvidia-persistenced[902]: Verbose syslog connection opened
Oct 07 00:14:02  nvidia-persistenced[902]: Now running with user ID 132 and group ID 135
Oct 07 00:14:02  nvidia-persistenced[902]: Started (902)
Oct 07 00:14:02  nvidia-persistenced[902]: device 0000:01:00.0 - registered
Oct 07 00:14:02  nvidia-persistenced[902]: Local RPC services initialized
Oct 08 01:32:20  kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0xffff:1204)
Oct 08 01:32:20  kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

(re-running nvidia smi results in the RmInitAdapter failed! message)

user63498 · November 24, 2021, 5:53am

Hey, I have same problem with RTX3060
but I found it can be solved with plug a HDMI monitor.

Here is my log report with a DP monitor
nvidia-bug-report.log.gz (394.5 KB)

Topic		Replies	Views
Problem installing 470.94 on Ubuntu 18.04 for GeForce RTX 3080TI Linux	8	1145	January 26, 2022
Ubuntu 20.04 blocked after upgrading nvidia drivers Linux	8	4052	March 1, 2022
GPU driver is lost when reboot the system CUDA Setup and Installation cuda	0	476	October 23, 2020
Ubuntu 22.04, TREX and No devices were found Linux cuda , ubuntu	15	1787	July 30, 2022
nVidia card has fallen off the bus CUDA Setup and Installation	1	1612	April 16, 2013
Nvidia driver error on Ubuntu 16.04 CUDA Setup and Installation	2	1212	August 21, 2018
Issues with 410 nvidia and cuda installation TensorRT cuda	1	447	October 5, 2020
Problem installing RTX 4000 on powerdge T640 Linux	2	260	November 18, 2023
Having trouble installing nVidia drivers on Ubuntu 21.04 Linux	0	1561	April 15, 2022
Drivers Instability - they "disappear" CUDA Setup and Installation	0	64	July 19, 2025

Ubuntu20.04 + nvidia-driver-470 - card drops out after minimal use /delay after first cuda command but works w/nvidia-driver-460

Related topics