Power9 CUDA 10.2, Unable to set persistence mode for GPU

kennric · May 18, 2020, 9:39pm

Hello all,

We are having an issue running CUDA code on a newly installed Power9 machine. We are running on bare metal, (not a container or VM) on an AC 922 with Centos 7.8.2003, kernel 4.18.0-147.0.3.el7.ppc64le. I have compiled some of the CUDA sample programs, they all fail with:

code=3(cudaErrorInitializationError) “cudaGetDeviceCount(&device_count)”

nvidia-smi returns correctly and shows our 4 V100 GPUs, but shows persistence mode “off”. nvidia-persistenced is running via systemd.

As suggested in the systemctl output below, I have checked for memory auto-online settings in the kernel and found that this is not turned on in the kernel or by udev - the memory online issue with persistenced seems like the culprit, but I can find no obvious reason for the memory online issue.

Any help would be appreciated.

dkms status

nvidia, 440.33.01, 4.18.0-147.0.3.el7.ppc64le, ppc64le: installed

systemctl status -l nvidia-persistenced

● nvidia-persistenced.service - NVIDIA Persistence Daemon

Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: disabled)

Active: active (running) since Mon 2020-05-04 11:03:20 PDT; 1 weeks 1 days ago

Main PID: 154360 (nvidia-persiste)

CGroup: /system.slice/nvidia-persistenced.service

      └─154360 /usr/bin/nvidia-persistenced --verbose

May 04 11:03:19 * nvidia-persistenced[154360]: NUMA: Probing memory address 0x23c3b0000000

May 04 11:03:19 * nvidia-persistenced[154360]: NUMA: Probing memory address 0x23c3c0000000

May 04 11:03:20 * nvidia-persistenced[154360]: NUMA: Probing memory address 0x23c3d0000000

May 04 11:03:20 * nvidia-persistenced[154360]: NUMA: Probing memory address 0x23c3e0000000

May 04 11:03:20 * nvidia-persistenced[154360]: NUMA: Device NUMA memory is already online

May 04 11:03:20 * nvidia-persistenced[154360]: NUMA: memory146449 state is online and the default zone is not movable (Normal).

This likely means that some non-NVIDIA software has auto-onlined

the device memory before nvidia-persistenced could. Please check

if the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config option

is enabled or if udev has a memory auto-online rule enabled under

/lib/udev/rules.d/.

May 04 11:03:20 * nvidia-persistenced[154360]: device 0035:04:00.0 - failed to online memory.

May 04 11:03:20 * nvidia-persistenced[154360]: device 0035:04:00.0 - persistence mode disabled.

May 04 11:03:20 * nvidia-persistenced[154360]: Local RPC services initialized

May 04 11:03:20 * systemd[1]: Started NVIDIA Persistence Daemon.

nvidia-smi

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Topic		Replies	Views
After installing CUDA 9.0 in POWER9(RHEL7), nvidia-smi shows Unknown Error in Memory_Usage column. CUDA Setup and Installation	18	3134	June 8, 2018
[SOLVED] Problems with nvidia-persistenced CUDA Setup and Installation	10	30986	January 11, 2019
Power9 - nvidia-smi shows "unknown error" in memory column Linux	35	10244	October 14, 2021
Nvidia driver installation on Power9 machine - Nvidia smi memory 'Unknown Error' Linux	3	607	May 19, 2019
Driver 525.85.12 reports (-1)ul memory available? Linux	6	601	February 25, 2023
Setting up nvidia-persistenced CUDA Setup and Installation	12	46648	July 19, 2020
Troubles with starting NVIDIA Persistence Daemon service (Ubuntu 16.04) CUDA Setup and Installation	0	1176	January 25, 2020
Power-9 (ppc64le) - Cuda9.2 - Nvidia driver failures Linux	5	673	December 24, 2018
GPU getting stuck, not able to execute any command using GPU Linux	1	2061	January 17, 2018
Cannot nvidia-smi Geforce 1070 anymore suddenly. Linux	9	1631	October 12, 2021

Power9 CUDA 10.2, Unable to set persistence mode for GPU

dkms status

systemctl status -l nvidia-persistenced

nvidia-smi

Related topics