On Operating System:
Ubuntu 20.04 LTS - kernel 5.11.0-37-generic (also happens on kernel 5.11.0-34-generic)
Cuda is functional on boot:
nvidia-smi -L
GPU 0: GeForce RTX 3060 (UUID: )
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 30% 30C P0 38W / 170W | 0MiB / 12051MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
works fine… at first. Can even dot matrixes in cupy. However after a few minutes (or hits of the driver, I can’t tell), it breaks / drops out and becomes a ‘no devices found’ that doesn’t respond to reset.
In fact installing any of the 470 drivers via the cuda repository(deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /
) results in a seemingly functional cuda that crashes out after several minutes completely unrelated to code:
- nvidia-driver-470
- nvidia-driver-470-server
- cuda-driver
- cuda-driver-470
etc. - specific versions of the driver in these packages appear to be either 470.63.0 or 470.57.02 - both exhibit this - simply running awatch nvidia-smi -L
will crash it out after a few minutes.
The system will load the driver and appear functional (can even dot matrixes in cupy) but several seconds or minutes later after use the driver will unload and dump the following into dmesg:
[101527.059084] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0xffff:1204)
[101527.059107] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[101527.090044] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0xffff:1204)
[101527.090062] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
CUDA is then not functional until the system is rebooted.
This does not happen with the 460-series driver 460.91.03, but that only gets me functional cuda up to 11.2. An identical stack (cuda libs, packages, container runtime, etc.) is functioning perfectly with the 470 driver across a range of different cards (V100s, A100s, 1xxx series, 2xxx series - we use what we can get) - but it consistently fails w/3060 RTX cards (across 3 cards so far).
The cards are still listed as compatible so I expect them to work, if not necessarily be as fast as they are in theory. Is this incorrect?
Running out of things to check / try.