Recent Nvidia Tesla drivers cause system crashs on POWERNVL systems with P100 GPUs, which impacts the capability to upgrade to Cuda Toolkit > 10.1
Affected hardware: IBM Power Systems S822LC for HPC (“Minsky”) with POWER8 CPU and 4x P100-SXM2-16GB GPU with NVLink
Affected software: reproduced with nvidia driver 450.80.02, 450.51.06, 440.118.02, and Linux kernel 4.15.0-124-generic (Ubuntu 18.04), 4.19.152-1 (Debian 10), 5.8.10-1~bpo10+1 (Debian 10).
This appears to be a regression : a workaround is to downgrade to nvidia driver 418.165.02 or 410.129, which unfortunately limits to using Cuda ≤10.1
How to reproduce: install nvidia driver from .run file, and run “nvidia-smi” in a loop. It can also be reproduced with “lstopo” command from “hwloc” package. System crash (kernel panic) will occur within 5 to 30 minutes and involves the “nvidia_uvm” module.
Example of kernel panic:
- Ubuntu 1804: Kernel Panic on IBM Power8 w/ Nvidia Tesla P100 SXM2 with Nvidia driver 450.80.02 or 450.51.06 or 440.118.02 - Ubuntu 1804 · GitHub
- Debian 10: Kernel Panic on IBM Power8 w/ Nvidia Tesla P100 SXM2 with Nvidia driver 450.80.02 or 450.51.06 or 440.118.02 · GitHub
Firmware versions on the system:
IBM-garrison-OP8_v1.12_2.96
op-build-v2.3-7-g99a6bc8
buildroot-2019.02.1-16-ge01dcd0
skiboot-v6.3.1
hostboot-p8-c893515-pd6f049d
occ-p8-a2856b7
linux-5.0.7-openpower1-p8e31f00
petitboot-v1.10.3
machine-xml-c5c3
(latest IBM firmware package: 8335GTB_820.1923 - OP820.30 - 07/01/2019)