Power9 CUDA 10.2, Unable to set persistence mode for GPU

Hello all,

We are having an issue running CUDA code on a newly installed Power9 machine. We are running on bare metal, (not a container or VM) on an AC 922 with Centos 7.8.2003, kernel 4.18.0-147.0.3.el7.ppc64le. I have compiled some of the CUDA sample programs, they all fail with:

code=3(cudaErrorInitializationError) “cudaGetDeviceCount(&device_count)”

nvidia-smi returns correctly and shows our 4 V100 GPUs, but shows persistence mode “off”. nvidia-persistenced is running via systemd.

As suggested in the systemctl output below, I have checked for memory auto-online settings in the kernel and found that this is not turned on in the kernel or by udev - the memory online issue with persistenced seems like the culprit, but I can find no obvious reason for the memory online issue.

Any help would be appreciated.

dkms status

nvidia, 440.33.01, 4.18.0-147.0.3.el7.ppc64le, ppc64le: installed

systemctl status -l nvidia-persistenced

● nvidia-persistenced.service - NVIDIA Persistence Daemon

Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: disabled)

Active: active (running) since Mon 2020-05-04 11:03:20 PDT; 1 weeks 1 days ago

Main PID: 154360 (nvidia-persiste)

CGroup: /system.slice/nvidia-persistenced.service

      └─154360 /usr/bin/nvidia-persistenced --verbose

May 04 11:03:19 * nvidia-persistenced[154360]: NUMA: Probing memory address 0x23c3b0000000

May 04 11:03:19 * nvidia-persistenced[154360]: NUMA: Probing memory address 0x23c3c0000000

May 04 11:03:20 * nvidia-persistenced[154360]: NUMA: Probing memory address 0x23c3d0000000

May 04 11:03:20 * nvidia-persistenced[154360]: NUMA: Probing memory address 0x23c3e0000000

May 04 11:03:20 * nvidia-persistenced[154360]: NUMA: Device NUMA memory is already online

May 04 11:03:20 * nvidia-persistenced[154360]: NUMA: memory146449 state is online and the default zone is not movable (Normal).

This likely means that some non-NVIDIA software has auto-onlined

the device memory before nvidia-persistenced could. Please check

if the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config option

is enabled or if udev has a memory auto-online rule enabled under

/lib/udev/rules.d/.

May 04 11:03:20 * nvidia-persistenced[154360]: device 0035:04:00.0 - failed to online memory.

May 04 11:03:20 * nvidia-persistenced[154360]: device 0035:04:00.0 - persistence mode disabled.

May 04 11:03:20 * nvidia-persistenced[154360]: Local RPC services initialized

May 04 11:03:20 * systemd[1]: Started NVIDIA Persistence Daemon.

nvidia-smi

Mon May 18 14:39:53 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000004:04:00.0 Off | 0 |
| N/A 37C P0 53W / 300W | 0MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000004:05:00.0 Off | 0 |
| N/A 38C P0 56W / 300W | 0MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000035:03:00.0 Off | 0 |
| N/A 36C P0 54W / 300W | 0MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000035:04:00.0 Off | 0 |
| N/A 40C P0 55W / 300W | 0MiB / 16160MiB | 4% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+