Hello all,
We are having an issue running CUDA code on a newly installed Power9 machine. We are running on bare metal, (not a container or VM) on an AC 922 with Centos 7.8.2003, kernel 4.18.0-147.0.3.el7.ppc64le. I have compiled some of the CUDA sample programs, they all fail with:
code=3(cudaErrorInitializationError) “cudaGetDeviceCount(&device_count)”
nvidia-smi returns correctly and shows our 4 V100 GPUs, but shows persistence mode “off”. nvidia-persistenced is running via systemd.
As suggested in the systemctl output below, I have checked for memory auto-online settings in the kernel and found that this is not turned on in the kernel or by udev - the memory online issue with persistenced seems like the culprit, but I can find no obvious reason for the memory online issue.
Any help would be appreciated.
dkms status
nvidia, 440.33.01, 4.18.0-147.0.3.el7.ppc64le, ppc64le: installed
systemctl status -l nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2020-05-04 11:03:20 PDT; 1 weeks 1 days ago
Main PID: 154360 (nvidia-persiste)
CGroup: /system.slice/nvidia-persistenced.service
└─154360 /usr/bin/nvidia-persistenced --verbose
May 04 11:03:19 * nvidia-persistenced[154360]: NUMA: Probing memory address 0x23c3b0000000
May 04 11:03:19 * nvidia-persistenced[154360]: NUMA: Probing memory address 0x23c3c0000000
May 04 11:03:20 * nvidia-persistenced[154360]: NUMA: Probing memory address 0x23c3d0000000
May 04 11:03:20 * nvidia-persistenced[154360]: NUMA: Probing memory address 0x23c3e0000000
May 04 11:03:20 * nvidia-persistenced[154360]: NUMA: Device NUMA memory is already online
May 04 11:03:20 * nvidia-persistenced[154360]: NUMA: memory146449 state is online and the default zone is not movable (Normal).
This likely means that some non-NVIDIA software has auto-onlined
the device memory before nvidia-persistenced could. Please check
if the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config option
is enabled or if udev has a memory auto-online rule enabled under
/lib/udev/rules.d/.
May 04 11:03:20 * nvidia-persistenced[154360]: device 0035:04:00.0 - failed to online memory.
May 04 11:03:20 * nvidia-persistenced[154360]: device 0035:04:00.0 - persistence mode disabled.
May 04 11:03:20 * nvidia-persistenced[154360]: Local RPC services initialized
May 04 11:03:20 * systemd[1]: Started NVIDIA Persistence Daemon.
nvidia-smi
Mon May 18 14:39:53 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000004:04:00.0 Off | 0 |
| N/A 37C P0 53W / 300W | 0MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000004:05:00.0 Off | 0 |
| N/A 38C P0 56W / 300W | 0MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000035:03:00.0 Off | 0 |
| N/A 36C P0 54W / 300W | 0MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000035:04:00.0 Off | 0 |
| N/A 40C P0 55W / 300W | 0MiB / 16160MiB | 4% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+