Recent Nvidia Tesla drivers cause system crashs on POWERNVL systems with P100 GPUs, which impacts the capability to upgrade to Cuda Toolkit > 10.1
Affected hardware: IBM Power Systems S822LC for HPC (“Minsky”) with POWER8 CPU and 4x P100-SXM2-16GB GPU with NVLink
Affected software: reproduced with nvidia driver 450.80.02, 450.51.06, 440.118.02, and Linux kernel 4.15.0-124-generic (Ubuntu 18.04), 4.19.152-1 (Debian 10), 5.8.10-1~bpo10+1 (Debian 10).
This appears to be a regression : a workaround is to downgrade to nvidia driver 418.165.02 or 410.129, which unfortunately limits to using Cuda ≤10.1
How to reproduce: install nvidia driver from .run file, and run “nvidia-smi” in a loop. It can also be reproduced with “lstopo” command from “hwloc” package. System crash (kernel panic) will occur within 5 to 30 minutes and involves the “nvidia_uvm” module.
Example of kernel panic:
- Ubuntu 1804: https://gist.github.com/npf/64cbe9b6feefa9589f1669692bbdd1c2
- Debian 10: https://gist.github.com/npf/ee0dab396e17d3fe6ac5b540b5c4d4b3
Firmware versions on the system:
(latest IBM firmware package: 8335GTB_820.1923 - OP820.30 - 07/01/2019)