After installing CUDA 9.0 in POWER9(RHEL7), nvidia-smi shows Unknown Error in Memory_Usage column.

I have installed CUDA 9.0 on RHEL based POWER9 and after installation I nvidia-smi showing following error.
What is this error and how to resolve this?

[root@localhost ~]# nvidia-smi
±----------------------------------------------------------------------------+
| NVIDIA-SMI 390.31 Driver Version: 390.31 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000004:04:00.0 Off | 0 |
| N/A 35C P0 48W / 300W | Unknown Error | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000004:05:00.0 Off | 0 |
| N/A 38C P0 51W / 300W | Unknown Error | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000035:03:00.0 Off | 0 |
| N/A 33C P0 48W / 300W | Unknown Error | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000035:04:00.0 Off | 0 |
| N/A 39C P0 52W / 300W | Unknown Error | 2% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±---------------------------------------------------

I suspect you didn’t follow the mandatory additional setup steps which are unique to Power9 CUDA 9/9.1 setup:

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#power9-setup

Thank you so much txbob!!
Actually, I have followed these step when installing CUDA. But unfortunately, I forgot to comment out one rule and that’s why I get this error.
I checked my configuration file again and fixed it. Now, nvidia-smi is working fine :)

[root@localhost ~]# nvidia-smi
Tue Apr 17 11:30:24 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 390.31 Driver Version: 390.31 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000004:04:00.0 Off | 0 |
| N/A 31C P0 35W / 300W | 0MiB / 15360MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000004:05:00.0 Off | 0 |
| N/A 33C P0 36W / 300W | 0MiB / 15360MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… On | 00000035:03:00.0 Off | 0 |
| N/A 29C P0 33W / 300W | 0MiB / 15360MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… On | 00000035:04:00.0 Off | 0 |
| N/A 33C P0 36W / 300W | 0MiB / 15360MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

I encountered the same problem on ubuntu 16.04. I also followed Power9 additional setup steps to make ‘/lib/udev/rules.d/40-vm-hotadd.rules’, but it did not work. The memory_usage column still showing ‘unknown error’

±----------------------------------------------------------------------------+
| NVIDIA-SMI 390.31 Driver Version: 390.31 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000004:04:00.0 Off | 0 |
| N/A 40C P0 52W / 300W | Unknown Error | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000004:05:00.0 Off | 0 |
| N/A 42C P0 54W / 300W | Unknown Error | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000035:03:00.0 Off | 0 |
| N/A 40C P0 50W / 300W | Unknown Error | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000035:04:00.0 Off | 0 |
| N/A 42C P0 52W / 300W | Unknown Error | 4% Default |
±------------------------------±---------------------±---------------------+

There are 2 changes that need to be made. That is one of them. There is another (read the linked section).

Then you need to reboot.

Thanks for the prompt reply. the nvidia-persistenced service has been enabled and running

#systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static; vendor preset: enabled)
Active: active (running) since Sun 2018-04-22 00:58:37 CST; 1min 35s ago
Process: 1878 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-m
Main PID: 1890 (nvidia-persiste)
CGroup: /system.slice/nvidia-persistenced.service
└─1890 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --ve

Apr 22 00:58:37 Ubuntu systemd[1]: Starting NVIDIA Persistence Daemon…
Apr 22 00:58:37 Ubuntu nvidia-persistenced[1890]: Verbose syslog connection opened
Apr 22 00:58:37 Ubuntu nvidia-persistenced[1890]: Now running with user ID 110 and group ID 118
Apr 22 00:58:37 Ubuntu nvidia-persistenced[1890]: Started (1890)
Apr 22 00:58:37 Ubuntu systemd[1]: Started NVIDIA Persistence Daemon.
Apr 22 00:58:37 Ubuntu nvidia-persistenced[1890]: device 0004:04:00.0 - registered
Apr 22 00:58:37 Ubuntu nvidia-persistenced[1890]: device 0004:05:00.0 - registered
Apr 22 00:58:37 Ubuntu nvidia-persistenced[1890]: device 0035:03:00.0 - registered
Apr 22 00:58:37 Ubuntu nvidia-persistenced[1890]: device 0035:04:00.0 - registered
Apr 22 00:58:37 Ubuntu nvidia-persistenced[1890]: Local RPC service initialized

what is the contents of your /lib/udev/rules.d/40-vm-hotadd.rules file?

cat /lib/udev/rules.d/40-vm-hotadd.rules

On Hyper-V and Xen Virtual Machines we want to add memory and cpus as soon as they appear

ATTR{[dmi/id]sys_vendor}==“Microsoft Corporation”, ATTR{[dmi/id]product_name}==“Virtual Machine”, GOTO=“vm_hotadd_apply”
ATTR{[dmi/id]sys_vendor}==“Xen”, GOTO=“vm_hotadd_apply”
GOTO=“vm_hotadd_end”

LABEL=“vm_hotadd_apply”

Memory hotadd request

#SUBSYSTEM==“memory”, ACTION==“add”, DEVPATH=="/devices/system/memory/memory[0-9]*", TEST==“state”, ATTR{state}=“online”

CPU hotadd request

SUBSYSTEM==“cpu”, ACTION==“add”, DEVPATH=="/devices/system/cpu/cpu[0-9]*", TEST==“online”, ATTR{online}=“1”

LABEL=“vm_hotadd_end”

uname -a

Linux Ubuntu 4.13.0-38-generic #43~16.04.1-Ubuntu SMP Wed Mar 14 17:46:55 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

what is the output of:

grep ‘SUBSYSTEM==“memory”’ /lib/udev/rules.d/*

also what is the output of:

dmesg |grep NVRM

grep ‘SUBSYSTEM==“memory”’ /lib/udev/rules.d/*

/lib/udev/rules.d/40-vm-hotadd.rules:#SUBSYSTEM==“memory”, ACTION==“add”, DEVPATH=="/devices/system/memory/memory[0-9]*", TEST==“state”, ATTR{state}=“online”

dmesg |grep NVRM

[ 2.853579] NVRM: loading NVIDIA UNIX ppc64le Kernel Module 390.31 Fri Feb 2 00:22:17 PST 2018 (using threaded interrupts)

have you rebooted?

is there any difference if you run the nvidia-smi command with sudo?

Anyway I’m pretty much out of ideas

Yes, I have rebooted.

@marswhc:what is the output of

cat /usr/lib/systemd/system/nvidia-persistenced.service

Did you export CUDA PATH [step 7.1.1.]?

hi Andrey1984,

cat /usr/lib/systemd/system/nvidia-persistenced.service

cat: /usr/lib/systemd/system/nvidia-persistenced.service: No such file or directory

echo $PATH

/usr/local/cuda-9.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

I did not deal with the issue, but the instruction said:
[i]
Create and enable a systemd service file or init script that runs the NVIDIA Persistence Daemon as
the first NVIDIA software during or at the end of the boot process. The following service file example is sufficient for most installations:
[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]
WantedBy=multi-user.target
Copy the above text into the following file:
/usr/lib/systemd/system/nvidia-persistenced.service
And run the following command:
$ sudo systemctl enable nvidia-persistenced[/i]
However since you have the service running from “/lib/systemd/system/nvidia-persistenced.service”
that shouldn’t be the issue.
Could you

cat /lib/systemd/system/nvidia-persistenced.service

?

cat /lib/systemd/system/nvidia-persistenced.service

[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

Now the ‘unknown message’ is gone after changed ubuntu kernel to 4.10. Thanks for you guys help!

I had the same problem in RedHat, I modify the file /etc/udev/rules.d/40-redhat.rules
is necesary comment this line

SUBSYSTEM==“memory”, ACTION==“add”, DEVPATH=="/devices/system/memory/memory[0-9]*", TEST==“state”, ATTR{state}=“online”

reboot and test

I am having this issue on Ubuntu 16.04, we have kernel 4.13.0-36-generic #40~16.04.1-Ubuntu SMP

marswhc - did you downgrade from kernel 4.13?