cat /sys/devices/system/memory/auto_online_blocks
offline
Some more details from journalctl with changes to the persistence daemon commandline. I did not reboot between changes, just edited the service, reloaded serviced, and restarted the servive. If you think rebooting will make a difference to the daemon startup with different options, I will try that too, but it takes so long to reboot the machine I wanted to run through these variations without rebooting first. In all cases below, nvidia-smi fails with the same output.
With the default persistenced copmmandline in place:
ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
-- Unit nvidia-persistenced.service has begun starting up.
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: Verbose syslog connection opened
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: Now running with user ID 113 and group ID 119
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: Started (68949)
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: device 0004:04:00.0 - registered
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: device 0004:05:00.0 - registered
Jun 18 09:32:46 openpower4 nvidia-persistenced[68949]: device 0035:03:00.0 - registered
Jun 18 09:32:46 openpower4 systemd[1]: Started NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has finished start-up
When the --no-persistence-mode is removed:
-- Unit nvidia-persistenced.service has begun starting up.
Jun 18 09:36:58 openpower4 nvidia-persistenced[69541]: Verbose syslog connection opened
Jun 18 09:36:58 openpower4 nvidia-persistenced[69541]: Now running with user ID 113 and group ID 119
Jun 18 09:36:58 openpower4 nvidia-persistenced[69541]: Started (69541)
Jun 18 09:36:58 openpower4 nvidia-persistenced[69541]: device 0004:04:00.0 - registered
Jun 18 09:36:59 openpower4 nvidia-persistenced[69541]: device 0004:04:00.0 - persistence mode enabled.
Jun 18 09:36:59 openpower4 nvidia-persistenced[69541]: NUMA: Failed ioctl call to set device NUMA status: Permission de
Jun 18 09:36:59 openpower4 nvidia-persistenced[69541]: device 0004:04:00.0 - NUMA: Failed to set device NUMA status to
Jun 18 09:36:59 openpower4 nvidia-persistenced[69541]: device 0004:04:00.0 - failed to online memory.
Jun 18 09:36:59 openpower4 kernel: ------------[ cut here ]------------
Jun 18 09:36:59 openpower4 kernel: WARNING: CPU: 117 PID: 69541 at /var/lib/dkms/nvidia-396/396.26/build/nvidia/nv.c:18
Jun 18 09:36:59 openpower4 kernel: Modules linked in: nvidia_uvm(POE) ofpart cmdlinepart powernv_flash mtd input_leds m
Jun 18 09:36:59 openpower4 kernel: CPU: 117 PID: 69541 Comm: nvidia-persiste Tainted: P W OE 4.13.0-36-generi
Jun 18 09:36:59 openpower4 kernel: task: c000007fb12ce600 task.stack: c0002072c724c000
Jun 18 09:36:59 openpower4 kernel: NIP: c0080000270610a4 LR: c008000027061250 CTR: c00000000016fac0
Jun 18 09:36:59 openpower4 kernel: REGS: c0002072c724f930 TRAP: 0700 Tainted: P W OE (4.13.0-36-generic)
Jun 18 09:36:59 openpower4 kernel: MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>
Jun 18 09:36:59 openpower4 kernel: CR: 24004824 XER: 00000000
Jun 18 09:36:59 openpower4 kernel: CFAR: c00800002706124c SOFTE: 1
GPR00: c008000027061250 c0002072c724fbb0 c008000027f44438 0000000000000000
GPR04: c000007fdd09b000 c000007fdd09b000 0000000000000000 0000000000000000
GPR08: c000007fdd09b1c0 0000000000000001 00000000000000ff c0080000279f8ab8
GPR12: c00000000016fac0 c00000000fad0700 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 000000001000a030 0000000000000001
GPR24: 0000000000000004 0000000000000000 c0002072f02f8800 c000007fc59812c0
GPR28: c000207266eade00 0000000000000000 c000007fdd09b000 c000007fdd09b000
Jun 18 09:36:59 openpower4 kernel: NIP [c0080000270610a4] nv_shutdown_adapter+0x64/0x140 [nvidia]
Jun 18 09:36:59 openpower4 kernel: LR [c008000027061250] nv_close_device+0xd0/0x250 [nvidia]
Jun 18 09:36:59 openpower4 kernel: Call Trace:
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fbb0] [c008000027076d88] nv_uvm_notify_stop_device+0x88/0xb0 [nvidia] (
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fbf0] [c008000027061250] nv_close_device+0xd0/0x250 [nvidia]
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fc70] [c0080000270668c4] nvidia_close+0xb4/0x390 [nvidia]
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fd20] [c008000027060670] nvidia_frontend_close+0x60/0xa0 [nvidia]
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fd50] [c000000000395cc8] __fput+0xe8/0x310
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fdb0] [c00000000012a260] task_work_run+0x140/0x1a0
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fe00] [c00000000001df34] do_notify_resume+0xf4/0x100
Jun 18 09:36:59 openpower4 kernel: [c0002072c724fe30] [c00000000000b7c4] ret_from_except_lite+0x70/0x74
Jun 18 09:36:59 openpower4 kernel: Instruction dump:
Jun 18 09:36:59 openpower4 kernel: e92501d0 7c9e2378 2fa90000 419e00e0 81490050 2f8affff 419e00d4 8129006c
Jun 18 09:36:59 openpower4 kernel: 2b890001 7d301026 5529f7fe 7d2907b4 <0b090000> 7fc3f378 4801361d 60000000
Jun 18 09:36:59 openpower4 kernel: ---[ end trace 5cd125178a22e10d ]---
Jun 18 09:36:59 openpower4 nvidia-persistenced[69541]: device 0004:04:00.0 - persistence mode disabled.
Jun 18 09:36:59 openpower4 nvidia-persistenced[69541]: device 0004:05:00.0 - registered
(repeated 4 times, one for each GPU device id)
When only the -user nvidia-persistenced is removed:
-- Unit nvidia-persistenced.service has begun starting up.
Jun 18 09:40:42 openpower4 nvidia-persistenced[69652]: Verbose syslog connection opened
Jun 18 09:40:42 openpower4 nvidia-persistenced[69652]: Started (69652)
Jun 18 09:40:42 openpower4 nvidia-persistenced[69652]: device 0004:04:00.0 - registered
Jun 18 09:40:42 openpower4 nvidia-persistenced[69652]: device 0004:05:00.0 - registered
Jun 18 09:40:42 openpower4 systemd[1]: Started NVIDIA Persistence Daemon.
-- Subject: Unit nvidia-persistenced.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit nvidia-persistenced.service has finished starting up.
And finally, when both options are removed:
-- Unit nvidia-persistenced.service has begun starting up.
Jun 18 09:42:32 openpower4 nvidia-persistenced[69744]: Verbose syslog connection opened
Jun 18 09:42:32 openpower4 nvidia-persistenced[69744]: Started (69744)
Jun 18 09:42:32 openpower4 nvidia-persistenced[69744]: device 0004:04:00.0 - registered
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: device 0004:04:00.0 - persistence mode enabled.
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: NUMA: Probing memory address 0x40000000000
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: NUMA: Failed to verify memory node 4096 was probed: No such file
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: device 0004:04:00.0 - NUMA: Probing memory failed: -2
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: NUMA: Failed to find any files in /sys/devices/system/node/node2
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: NUMA: Failed to get all memblock ID's for node255
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: NUMA: Changing node255 state to offline failed
Jun 18 09:42:33 openpower4 nvidia-persistenced[69744]: device 0004:04:00.0 - failed to online memory.
Jun 18 09:42:33 openpower4 kernel: ------------[ cut here ]------------
Jun 18 09:42:33 openpower4 kernel: WARNING: CPU: 130 PID: 69744 at /var/lib/dkms/nvidia-396/396.26/build/nvidia/nv.c:18
Jun 18 09:42:33 openpower4 kernel: Modules linked in: nvidia_uvm(POE) ofpart cmdlinepart powernv_flash mtd input_leds m
Jun 18 09:42:33 openpower4 kernel: CPU: 130 PID: 69744 Comm: nvidia-persiste Tainted: P W OE 4.13.0-36-generi
Jun 18 09:42:33 openpower4 kernel: task: c0002072f0919e00 task.stack: c0002072f09a0000
Jun 18 09:42:33 openpower4 kernel: NIP: c0080000270610a4 LR: c008000027061250 CTR: c00000000016fac0
Jun 18 09:42:33 openpower4 kernel: REGS: c0002072f09a3930 TRAP: 0700 Tainted: P W OE (4.13.0-36-generic)
Jun 18 09:42:33 openpower4 kernel: MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>
Jun 18 09:42:33 openpower4 kernel: CR: 24004824 XER: 00000000
Jun 18 09:42:33 openpower4 kernel: CFAR: c00800002706124c SOFTE: 1
GPR00: c008000027061250 c0002072f09a3bb0 c008000027f44438 0000000000000000
GPR04: c000007fdd09b000 c000007fdd09b000 0000000000000000 0000000000000000
GPR08: c000007fdd09b1c0 0000000000000001 00000000000000ff c0080000279f8ab8
GPR12: c00000000016fac0 c00000000fad9600 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 000000001000a030 0000000000000001
GPR24: 0000000000000004 0000000000000000 c0002072ecf6e700 c000007fc59812c0
GPR28: c0002072df85d400 0000000000000000 c000007fdd09b000 c000007fdd09b000
Jun 18 09:42:33 openpower4 kernel: NIP [c0080000270610a4] nv_shutdown_adapter+0x64/0x140 [nvidia]
Jun 18 09:42:33 openpower4 kernel: LR [c008000027061250] nv_close_device+0xd0/0x250 [nvidia]
Jun 18 09:42:33 openpower4 kernel: Call Trace:
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3bb0] [c008000027076d88] nv_uvm_notify_stop_device+0x88/0xb0 [nvidia] (
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3bf0] [c008000027061250] nv_close_device+0xd0/0x250 [nvidia]
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3c70] [c0080000270668c4] nvidia_close+0xb4/0x390 [nvidia]
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3d20] [c008000027060670] nvidia_frontend_close+0x60/0xa0 [nvidia]
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3d50] [c000000000395cc8] __fput+0xe8/0x310
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3db0] [c00000000012a260] task_work_run+0x140/0x1a0
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3e00] [c00000000001df34] do_notify_resume+0xf4/0x100
Jun 18 09:42:33 openpower4 kernel: [c0002072f09a3e30] [c00000000000b7c4] ret_from_except_lite+0x70/0x74
Jun 18 09:42:33 openpower4 kernel: Instruction dump:
Jun 18 09:42:33 openpower4 kernel: e92501d0 7c9e2378 2fa90000 419e00e0 81490050 2f8affff 419e00d4 8129006c
Jun 18 09:42:33 openpower4 kernel: 2b890001 7d301026 5529f7fe 7d2907b4 <0b090000> 7fc3f378 4801361d 60000000
Jun 18 09:42:33 openpower4 kernel: ---[ end trace 5cd125178a22e115 ]---
(repeated 4 times)
Thanks!