CUDA hangs during cuInit

The first time a process interacts with cuda, it seems to cause a 10s-of-seconds hang. For instance, cupy.cuda.runtime.getDeviceCount() takes over 60 seconds the first time it is called, but subsequent calls within the same process are fast. A basic hello-world cuda-example has the same symptom. I was able to use Nsight Compute and determined that cuInit is the culprit. It takes 30-60 seconds although it returns a success code. The only exception is that nvidia-smi is able to show status immediately.

This is not a headless node. It is running Ubuntu 20.04. I have tried nvidia-smi -pm 0 and 1, with no effect.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    On   | 00000000:19:00.0 Off |                  Off |
| 30%   29C    P8    16W / 230W |     10MiB / 24256MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    On   | 00000000:1A:00.0 Off |                  Off |
| 30%   35C    P8    16W / 230W |     10MiB / 24256MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000    On   | 00000000:67:00.0 Off |                  Off |
| 30%   39C    P8    18W / 230W |     10MiB / 24256MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000    On   | 00000000:68:00.0 Off |                  Off |
| 30%   40C    P8    21W / 230W |    214MiB / 24253MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1808      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A     11327      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1808      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A     11327      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      1808      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A     11327      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      1808      G   /usr/lib/xorg/Xorg                 57MiB |
|    3   N/A  N/A     11327      G   /usr/lib/xorg/Xorg                121MiB |
|    3   N/A  N/A     11481      G   /usr/bin/gnome-shell               24MiB |
+-----------------------------------------------------------------------------+

Checkout Robert’s reply here: