Power9 - nvidia-smi shows "unknown error" in memory column

We have an IBM Power9 system with 4 Tesla V100 GPUs for research use, and are unable to get things working.

We have followed the guide here:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#introduction

The machine is running Ubuntu 16.04 with kernel 4.13 and we have installed nvidia-driver version 396.26 with CUDA Toolkit cuda-9.2.88.

We have followed the Power9 specific steps in the install guide:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#power9-setup

nvidia-persistenced is running, and we have overridden the udev rule for memory hotplug.

When running nvidia-smi we we see an ‘unknown error’ in the memory usage
column:

root@openpower:~# nvidia-smi
Fri Jun 8 13:40:44 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26
|
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile
Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000004:04:00.0 Off |
0 |
| N/A 41C P0 51W / 300W | Unknown Error | 0%
Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000004:05:00.0 Off |
0 |
| N/A 41C P0 50W / 300W | Unknown Error | 0%
Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000035:03:00.0 Off |
0 |
| N/A 36C P0 51W / 300W | Unknown Error | 0%
Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000035:04:00.0 Off |
0 |
| N/A 39C P0 50W / 300W | Unknown Error | 0%
Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU
Memory |
| GPU PID Type Process name Usage
|
|=======================================================================
|======|
| No running processes found
|

This is the same issue cited in this forum thread:
https://devtalk.nvidia.com/default/topic/1032345/after-installing-cuda-9-0-in-power9-rhel7-nvidia-smi-shows-unknown-error-in-memory_usage-column-/?offset=12

but there are no suggestions on that thread that have helped.

We were able to compile the sample code, but running the deviceQuery script fails:

root@openpower:~/NVIDIA_CUDA-9.2_Samples#
./bin/ppc64le/linux/release/deviceQuery
./bin/ppc64le/linux/release/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 3
-> initialization error
Result = FAIL

(note that there are 4 GPUs, but this script fails with “cudaGetDeviceCount returned 3”)

Any ideas?

Thanks,
Ken

nvidia-bug-report.log (1.79 MB)

Maybe just the initrd still contains the unchanged udev rule, run
sudo update-initramfs -u
and reboot.
Don’t know why deviceQuery states 3 devices, according to the logs, all 4 devices are online.

In the other thread I referenced above, someone mentioned that downgrading to kernel 4.10 fixed the issue, so I have been experimenting with older kernels. I have downgraded from 4.13 to 4.11 and finally to 4.10, re-installing nvidia drivers on each kernel. The install process of course regenerates the initrd, but I’ve seen the same results on every kernel.

My /etc/udev/rules.d/40-vm-hotadd-rules looks like this:

# On Hyper-V and Xen Virtual Machines we want to add memory and cpus as soon as they appear
ATTR{[dmi/id]sys_vendor}=="Microsoft Corporation", ATTR{[dmi/id]product_name}=="Virtual Machine", GOTO="vm_hotadd_apply"
ATTR{[dmi/id]sys_vendor}=="Xen", GOTO="vm_hotadd_apply"
GOTO="vm_hotadd_end"

LABEL="vm_hotadd_apply"

# Memory hotadd request
#SUBSYSTEM=="memory", ACTION=="add", DEVPATH=="/devices/system/memory/memory[0-9]*", TEST=="state", ATTR{state}="online"

# CPU hotadd request
SUBSYSTEM=="cpu", ACTION=="add", DEVPATH=="/devices/system/cpu/cpu[0-9]*", TEST=="online", ATTR{online}="1"

LABEL="vm_hotadd_end"

I will attach the bug report log for kernel 4.10 here, in case that helps.

nvidia-bug-report.log (1.79 MB)

Hi kennric,
I’d like to suggest you file a separate bug from ”https://developer.nvidia.com/” (login ->My account ->My Bugs ->Submit a New Bug) for tracking the issue easily and updating you quickly in the future.
Thanks for your cooperation.

When I tried to submit the bug, I get the following error:

An AJAX HTTP error occurred.
HTTP Result Code: 403
Debugging information follows.
Path: /system/ajax
StatusText: Forbidden
ResponseText:
403 - Forbidden
403 - Forbidden

You could check if the dev nodes are created with correct access rights:
ls -l /dev/nvidia*
and look for the nvidia-uvm nodes:
crw-rw---- 1 root video 195, 0 14. Jun 08:52 /dev/nvidia0
crw-rw---- 1 root video 195, 255 14. Jun 08:52 /dev/nvidiactl
crw-rw---- 1 root video 195, 254 14. Jun 08:52 /dev/nvidia-modeset
crw-rw-rw- 1 root root 249, 0 14. Jun 21:26 /dev/nvidia-uvm
crw-rw-rw- 1 root root 249, 1 14. Jun 21:26 /dev/nvidia-uvm-tools

root@openpower4:~# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Jun 13 10:48 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Jun 13 10:48 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Jun 13 10:48 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Jun 13 10:48 /dev/nvidia3
crw-rw-rw- 1 root root 195, 255 Jun 13 10:48 /dev/nvidiactl
crw-rw-rw- 1 root root 238,   0 Jun 13 10:48 /dev/nvidia-uvm

I don’t have a /dev/nvidia-uvm-tools, and the video devices are not group video - I changed those, but runnign nvidia-smi reset the ownership:

root@openpower4:~# chown root:video /dev/nvidia1
root@openpower4:~# chown root:video /dev/nvidia2
root@openpower4:~# chown root:video /dev/nvidia3
root@openpower4:~# chown root:video /dev/nvidia0
root@openpower4:~# ls -l /dev/nvidia*
crw-rw-rw- 1 root video 195,   0 Jun 13 10:48 /dev/nvidia0
crw-rw-rw- 1 root video 195,   1 Jun 13 10:48 /dev/nvidia1
crw-rw-rw- 1 root video 195,   2 Jun 13 10:48 /dev/nvidia2
crw-rw-rw- 1 root video 195,   3 Jun 13 10:48 /dev/nvidia3
crw-rw-rw- 1 root root  195, 255 Jun 13 10:48 /dev/nvidiactl
crw-rw-rw- 1 root root  238,   0 Jun 13 10:48 /dev/nvidia-uvm
root@openpower4:~# nvidia-smi
Thu Jun 14 12:52:50 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000004:04:00.0 Off |                    0 |
| N/A   42C    P0    51W / 300W | Unknown Error        |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000004:05:00.0 Off |                    0 |
| N/A   44C    P0    50W / 300W | Unknown Error        |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000035:03:00.0 Off |                    0 |
| N/A   38C    P0    51W / 300W | Unknown Error        |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000035:04:00.0 Off |                    0 |
| N/A   42C    P0    51W / 300W | Unknown Error        |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@openpower4:~# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Jun 13 10:48 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Jun 13 10:48 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Jun 13 10:48 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Jun 13 10:48 /dev/nvidia3
crw-rw-rw- 1 root root 195, 255 Jun 13 10:48 /dev/nvidiactl
crw-rw-rw- 1 root root 238,   0 Jun 13 10:48 /dev/nvidia-uvm

The video group is just specific to my system, your access rights are standard, ok.
More of a concern is the missing /dev/nvidia-uvm-tools node, seems like ubuntu 16.04 ships an old version of nvidia-modprobe, newer versions create that:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-384/+bug/1760727
a workaround would be to create it manually after each boot:

sudo mknod -m 666 /dev/nvidia-uvm-tools c $(grep nvidia-uvm /proc/devices | awk '{print $1}') 1

Thanks for you help here, generix.

I have created the uvm-tools device, but it doesn’t appear to fix the issue:

root@openpower4:~# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Jun 13 10:48 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Jun 13 10:48 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Jun 13 10:48 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Jun 13 10:48 /dev/nvidia3
crw-rw-rw- 1 root root 195, 255 Jun 13 10:48 /dev/nvidiactl
crw-rw-rw- 1 root root 238,   0 Jun 13 10:48 /dev/nvidia-uvm
crw-rw-rw- 1 root root 238,   1 Jun 14 13:11 /dev/nvidia-uvm-tools
root@openpower4:~# nvidia-smi
Thu Jun 14 13:12:51 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000004:04:00.0 Off |                    0 |
| N/A   42C    P0    51W / 300W | Unknown Error        |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000004:05:00.0 Off |                    0 |
| N/A   44C    P0    51W / 300W | Unknown Error        |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000035:03:00.0 Off |                    0 |
| N/A   38C    P0    51W / 300W | Unknown Error        |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000035:04:00.0 Off |                    0 |
| N/A   42C    P0    51W / 300W | Unknown Error        |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Then I’m quite out of ideas, all that’s left would be grepping through /etc/udev/rules.d if there are duplicates of the rule to comment out.
Looks rather like a driver bug. Unfortunately with the latest cuda you’re running, no other compatible drivers are available right now.

I can try downgrading CUDA and using an older driver, assuming it supports the V100s, though I think the reason we have 9.2 was because we saw this error on the older driver/cuda. I’ll give it a try and see what happens.

BTW, the kernel 4.10 log is the old 4.13 log, you’ll have to delete the old nvidia-bug-report.log before re-running it as it sometimes doesn’t rename the old one.

One little oddity:

*** /proc/driver/nvidia/./gpus/0035:03:00.0/information
*** ls: -r--r--r-- 1 root root 0 2018-06-08 13:52:59.030536765 -0700 /proc/driver/nvidia/./gpus/0035:03:00.0/information
Model: 		 Tesla V100-SXM2-16GB
IRQ:   		 671
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 64 bits
DMA Mask: 	 0xffffffffffffffff
Bus Location: 	 0035:03:00.0
Device Minor: 	 2

The question marks tell that the driver is not initialized though the nvidia-perstistenced should take care of it. Maybe search for the nvidia-persistenced.service in /lib/systemd/system or /usr/lib/systemd/system, edit it and remove the option --no-persistence-mode to use the old persistence mode.

Again, thanks for your help, generix!

After making this change, I noticed something in the systemd log that I either missed before, or wasn’t there before:

Jun 15 09:13:24 openpower4 nvidia-persistenced[3424]: device 0035:03:00.0 - persistence mode enabled.
Jun 15 09:13:24 openpower4 nvidia-persistenced[3424]: NUMA: Failed ioctl call to set device NUMA status: Permission denied
Jun 15 09:13:24 openpower4 nvidia-persistenced[3424]: device 0035:03:00.0 - NUMA: Failed to set device NUMA status to online_in_progress
Jun 15 09:13:24 openpower4 nvidia-persistenced[3424]: device 0035:03:00.0 - failed to online memory.
Jun 15 09:13:25 openpower4 nvidia-persistenced[3424]: device 0035:03:00.0 - persistence mode disabled.

I’m not clear on what exactly is denying permission to NUMA, my devices are:

root@openpower4:~# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Jun 15 09:08 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Jun 15 09:08 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Jun 15 09:08 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Jun 15 09:08 /dev/nvidia3
crw-rw-rw- 1 root root 195, 255 Jun 15 09:08 /dev/nvidiactl
crw-rw-rw- 1 root root 237,   0 Jun 15 09:07 /dev/nvidia-uvm
crw-rw-rw- 1 root root 237,   1 Jun 15 09:12 /dev/nvidia-uvm-tools

and persistenced is running:

nvidia-+ 3424 1 1 09:13 ? 00:00:04 /usr/bin/nvidia-persistenced --user nvidia-persistenced --verbose

BTW, this is back on kernel 4.13.
nvidia-bug-report.log.gz (196 KB)

At least some more info. Looks like the kernel/driver is trying to auto-online the memory which shouldn’t happen (? deducting this from the udev rule that has to be disabled). What’s the output of
sudo cat /sys/devices/system/memory/auto_online_blocks

On second thought, seems more like an intended function of the persistence daemon. Since it states ‘permission denied’, what happens if you run it as root, i.e. again edit the nvidia-persistenced.service and delete the option ‘–user nvidia-persistenced’?

cat /sys/devices/system/memory/auto_online_blocks 
offline

Some more details from journalctl with changes to the persistence daemon commandline. I did not reboot between changes, just edited the service, reloaded serviced, and restarted the servive. If you think rebooting will make a difference to the daemon startup with different options, I will try that too, but it takes so long to reboot the machine I wanted to run through these variations without rebooting first. In all cases below, nvidia-smi fails with the same output.

With the default persistenced copmmandline in place:

ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose

-- Unit nvidia-persistenced.service has begun starting up.
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: Verbose syslog connection opened
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: Now running with user ID 113 and group ID 119
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: Started (68949)
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: device 0004:04:00.0 - registered
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: device 0004:05:00.0 - registered
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: device 0035:03:00.0 - registered
Jun 18 09:32:46 openpower4 systemd[1]: Started NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has finished start-up

When the --no-persistence-mode is removed:

-- Unit nvidia-persistenced.service has begun starting up.
Jun 18 09:36:58 openpower4 nvidia-persistenced[69541]: Verbose syslog connection opened
Jun 18 09:36:58 openpower4 nvidia-persistenced[69541]: Now running with user ID 113 and group ID 119
Jun 18 09:36:58 openpower4 nvidia-persistenced[69541]: Started (69541)
Jun 18 09:36:58 openpower4 nvidia-persistenced[69541]: device 0004:04:00.0 - registered
Jun 18 09:36:59 openpower4 nvidia-persistenced[69541]: device 0004:04:00.0 - persistence mode enabled.
Jun 18 09:36:59 openpower4 nvidia-persistenced[69541]: NUMA: Failed ioctl call to set device NUMA status: Permission de
Jun 18 09:36:59 openpower4 nvidia-persistenced[69541]: device 0004:04:00.0 - NUMA: Failed to set device NUMA status to 
Jun 18 09:36:59 openpower4 nvidia-persistenced[69541]: device 0004:04:00.0 - failed to online memory.
Jun 18 09:36:59 openpower4 kernel: ------------[ cut here ]------------
Jun 18 09:36:59 openpower4 kernel: WARNING: CPU: 117 PID: 69541 at /var/lib/dkms/nvidia-396/396.26/build/nvidia/nv.c:18
Jun 18 09:36:59 openpower4 kernel: Modules linked in: nvidia_uvm(POE) ofpart cmdlinepart powernv_flash mtd input_leds m
Jun 18 09:36:59 openpower4 kernel: CPU: 117 PID: 69541 Comm: nvidia-persiste Tainted: P        W  OE   4.13.0-36-generi
Jun 18 09:36:59 openpower4 kernel: task: c000007fb12ce600 task.stack: c0002072c724c000
Jun 18 09:36:59 openpower4 kernel: NIP: c0080000270610a4 LR: c008000027061250 CTR: c00000000016fac0
Jun 18 09:36:59 openpower4 kernel: REGS: c0002072c724f930 TRAP: 0700   Tainted: P        W  OE    (4.13.0-36-generic)
Jun 18 09:36:59 openpower4 kernel: MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>
Jun 18 09:36:59 openpower4 kernel:   CR: 24004824  XER: 00000000
Jun 18 09:36:59 openpower4 kernel: CFAR: c00800002706124c SOFTE: 1 
                                   GPR00: c008000027061250 c0002072c724fbb0 c008000027f44438 0000000000000000 
                                   GPR04: c000007fdd09b000 c000007fdd09b000 0000000000000000 0000000000000000 
                                   GPR08: c000007fdd09b1c0 0000000000000001 00000000000000ff c0080000279f8ab8 
                                   GPR12: c00000000016fac0 c00000000fad0700 0000000000000000 0000000000000000 
                                   GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
                                   GPR20: 0000000000000000 0000000000000000 000000001000a030 0000000000000001 
                                   GPR24: 0000000000000004 0000000000000000 c0002072f02f8800 c000007fc59812c0 
                                   GPR28: c000207266eade00 0000000000000000 c000007fdd09b000 c000007fdd09b000 
Jun 18 09:36:59 openpower4 kernel: NIP [c0080000270610a4] nv_shutdown_adapter+0x64/0x140 [nvidia]
Jun 18 09:36:59 openpower4 kernel: LR [c008000027061250] nv_close_device+0xd0/0x250 [nvidia]
Jun 18 09:36:59 openpower4 kernel: Call Trace:
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fbb0] [c008000027076d88] nv_uvm_notify_stop_device+0x88/0xb0 [nvidia] (
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fbf0] [c008000027061250] nv_close_device+0xd0/0x250 [nvidia]
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fc70] [c0080000270668c4] nvidia_close+0xb4/0x390 [nvidia]
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fd20] [c008000027060670] nvidia_frontend_close+0x60/0xa0 [nvidia]
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fd50] [c000000000395cc8] __fput+0xe8/0x310
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fdb0] [c00000000012a260] task_work_run+0x140/0x1a0
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fe00] [c00000000001df34] do_notify_resume+0xf4/0x100
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fe30] [c00000000000b7c4] ret_from_except_lite+0x70/0x74
Jun 18 09:36:59 openpower4 kernel: Instruction dump:
Jun 18 09:36:59 openpower4 kernel: e92501d0 7c9e2378 2fa90000 419e00e0 81490050 2f8affff 419e00d4 8129006c 
Jun 18 09:36:59 openpower4 kernel: 2b890001 7d301026 5529f7fe 7d2907b4 <0b090000> 7fc3f378 4801361d 60000000 
Jun 18 09:36:59 openpower4 kernel: ---[ end trace 5cd125178a22e10d ]---
Jun 18 09:36:59 openpower4 nvidia-persistenced[69541]: device 0004:04:00.0 - persistence mode disabled.
Jun 18 09:36:59 openpower4 nvidia-persistenced[69541]: device 0004:05:00.0 - registered

(repeated 4 times, one for each GPU device id)

When only the -user nvidia-persistenced is removed:

-- Unit nvidia-persistenced.service has begun starting up.
Jun 18 09:40:42 openpower4 nvidia-persistenced[69652]: Verbose syslog connection opened
Jun 18 09:40:42 openpower4 nvidia-persistenced[69652]: Started (69652)
Jun 18 09:40:42 openpower4 nvidia-persistenced[69652]: device 0004:04:00.0 - registered
Jun 18 09:40:42 openpower4 nvidia-persistenced[69652]: device 0004:05:00.0 - registered
Jun 18 09:40:42 openpower4 systemd[1]: Started NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit nvidia-persistenced.service has finished starting up.

And finally, when both options are removed:

-- Unit nvidia-persistenced.service has begun starting up.
Jun 18 09:42:32 openpower4 nvidia-persistenced[69744]: Verbose syslog connection opened
Jun 18 09:42:32 openpower4 nvidia-persistenced[69744]: Started (69744)
Jun 18 09:42:32 openpower4 nvidia-persistenced[69744]: device 0004:04:00.0 - registered
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: device 0004:04:00.0 - persistence mode enabled.
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: NUMA: Probing memory address 0x40000000000
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: NUMA: Failed to verify memory node 4096 was probed: No such file
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: device 0004:04:00.0 - NUMA: Probing memory failed: -2
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: NUMA: Failed to find any files in /sys/devices/system/node/node2
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: NUMA: Failed to get all memblock ID's for node255
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: NUMA: Changing node255 state to offline failed
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: device 0004:04:00.0 - failed to online memory.
Jun 18 09:42:33 openpower4 kernel: ------------[ cut here ]------------
Jun 18 09:42:33 openpower4 kernel: WARNING: CPU: 130 PID: 69744 at /var/lib/dkms/nvidia-396/396.26/build/nvidia/nv.c:18
Jun 18 09:42:33 openpower4 kernel: Modules linked in: nvidia_uvm(POE) ofpart cmdlinepart powernv_flash mtd input_leds m
Jun 18 09:42:33 openpower4 kernel: CPU: 130 PID: 69744 Comm: nvidia-persiste Tainted: P        W  OE   4.13.0-36-generi
Jun 18 09:42:33 openpower4 kernel: task: c0002072f0919e00 task.stack: c0002072f09a0000
Jun 18 09:42:33 openpower4 kernel: NIP: c0080000270610a4 LR: c008000027061250 CTR: c00000000016fac0
Jun 18 09:42:33 openpower4 kernel: REGS: c0002072f09a3930 TRAP: 0700   Tainted: P        W  OE    (4.13.0-36-generic)
Jun 18 09:42:33 openpower4 kernel: MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>
Jun 18 09:42:33 openpower4 kernel:   CR: 24004824  XER: 00000000
Jun 18 09:42:33 openpower4 kernel: CFAR: c00800002706124c SOFTE: 1 
                                   GPR00: c008000027061250 c0002072f09a3bb0 c008000027f44438 0000000000000000 
                                   GPR04: c000007fdd09b000 c000007fdd09b000 0000000000000000 0000000000000000 
                                   GPR08: c000007fdd09b1c0 0000000000000001 00000000000000ff c0080000279f8ab8 
                                   GPR12: c00000000016fac0 c00000000fad9600 0000000000000000 0000000000000000 
                                   GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
                                   GPR20: 0000000000000000 0000000000000000 000000001000a030 0000000000000001 
                                   GPR24: 0000000000000004 0000000000000000 c0002072ecf6e700 c000007fc59812c0 
                                   GPR28: c0002072df85d400 0000000000000000 c000007fdd09b000 c000007fdd09b000 
Jun 18 09:42:33 openpower4 kernel: NIP [c0080000270610a4] nv_shutdown_adapter+0x64/0x140 [nvidia]
Jun 18 09:42:33 openpower4 kernel: LR [c008000027061250] nv_close_device+0xd0/0x250 [nvidia]
Jun 18 09:42:33 openpower4 kernel: Call Trace:
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3bb0] [c008000027076d88] nv_uvm_notify_stop_device+0x88/0xb0 [nvidia] (
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3bf0] [c008000027061250] nv_close_device+0xd0/0x250 [nvidia]
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3c70] [c0080000270668c4] nvidia_close+0xb4/0x390 [nvidia]
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3d20] [c008000027060670] nvidia_frontend_close+0x60/0xa0 [nvidia]
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3d50] [c000000000395cc8] __fput+0xe8/0x310
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3db0] [c00000000012a260] task_work_run+0x140/0x1a0
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3e00] [c00000000001df34] do_notify_resume+0xf4/0x100
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3e30] [c00000000000b7c4] ret_from_except_lite+0x70/0x74
Jun 18 09:42:33 openpower4 kernel: Instruction dump:
Jun 18 09:42:33 openpower4 kernel: e92501d0 7c9e2378 2fa90000 419e00e0 81490050 2f8affff 419e00d4 8129006c 
Jun 18 09:42:33 openpower4 kernel: 2b890001 7d301026 5529f7fe 7d2907b4 <0b090000> 7fc3f378 4801361d 60000000 
Jun 18 09:42:33 openpower4 kernel: ---[ end trace 5cd125178a22e115 ]---

(repeated 4 times)

Thanks!

Oh, and one more thing that is interesting in the above output - with the default daemon commandline, the systemd output shows 3 devices being initialized:

Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: device 0004:04:00.0 - registered
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: device 0004:05:00.0 - registered
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: device 0035:03:00.0 - registered

In all other option variations, the daemon reports initializing all 4 devices, 0004:04:00.0, 0004:05:00.0, 0035:03:00.0, and 0035:04:00.0. Possibly related to other error output reporting only 3 devices.

That looks like kernel problems. I’m a bit confused, the nvidia driver states it is for Power 9 on Ubuntu 16.04. Ubuntu states, 18.04 is the first release to fully support Power 9, and this bug report states you’ll need either kernel 4.17 or a patched lower one for Volta on Power 9 on bare metal (possibly already included in current 18.04 kernel):
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772991

Joining this thread, as we are going through the same pains.

I have just installed our new IBM Power AC922 (8335-GTH)

$ uname -a
Linux dksrv196 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 17:59:00 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
$ nvidia-smi
Thu Jun 21 13:46:48 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000004:04:00.0 Off |                    0 |
| N/A   36C    P0    52W / 300W | Unknown Error        |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000035:03:00.0 Off |                    0 |
| N/A   36C    P0    50W / 300W | Unknown Error        |      2%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Installing kernel 4.17, disables nvidia driver,
and building or reinstalling it results in error:

Loading new nvidia-396-396.26 DKMS files...
Building for 4.17.0-041700-generic
Building for architecture ppc64el
Building initial module for 4.17.0-041700-generic
ERROR (dkms apport): kernel package linux-headers-4.17.0-041700-generic is not supported
Error! Bad return status for module build on kernel: 4.17.0-041700-generic (ppc64el)
Consult /var/lib/dkms/nvidia-396/396.26/build/make.log for more information.

make.log is full of format errors, such as this:

/bin/sh: 1: scripts/basic/fixdep: Exec format error
scripts/Makefile.build:312: recipe for target '/var/lib/dkms/nvidia-396/396.26/build/nvidia/nv-mempool.o' failed
make[2]: *** [/var/lib/dkms/nvidia-396/396.26/build/nvidia/nv-mempool.o] Error 2
make[2]: *** Waiting for unfinished jobs....
/bin/sh: 1: scripts/basic/fixdep: Exec format error
scripts/Makefile.build:312: recipe for target '/var/lib/dkms/nvidia-396/396.26/build/nvidia/nv-cray.o' failed

Edit: added code blocks for formatting.

jensbv134, did you try the test kernel from the bug report which includes all patches?