Recent nvidia Tesla drivers cause system crashs on POWERNVL w/ P100 GPUs

Recent Nvidia Tesla drivers cause system crashs on POWERNVL systems with P100 GPUs, which impacts the capability to upgrade to Cuda Toolkit > 10.1

Affected hardware: IBM Power Systems S822LC for HPC (“Minsky”) with POWER8 CPU and 4x P100-SXM2-16GB GPU with NVLink

Affected software: reproduced with nvidia driver 450.80.02, 450.51.06, 440.118.02, and Linux kernel 4.15.0-124-generic (Ubuntu 18.04), 4.19.152-1 (Debian 10), 5.8.10-1~bpo10+1 (Debian 10).

This appears to be a regression : a workaround is to downgrade to nvidia driver 418.165.02 or 410.129, which unfortunately limits to using Cuda ≤10.1

How to reproduce: install nvidia driver from .run file, and run “nvidia-smi” in a loop. It can also be reproduced with “lstopo” command from “hwloc” package. System crash (kernel panic) will occur within 5 to 30 minutes and involves the “nvidia_uvm” module.

Example of kernel panic:

Firmware versions on the system:
IBM-garrison-OP8_v1.12_2.96
op-build-v2.3-7-g99a6bc8
buildroot-2019.02.1-16-ge01dcd0
skiboot-v6.3.1
hostboot-p8-c893515-pd6f049d
occ-p8-a2856b7
linux-5.0.7-openpower1-p8e31f00
petitboot-v1.10.3
machine-xml-c5c3
(latest IBM firmware package: 8335GTB_820.1923 - OP820.30 - 07/01/2019)

The kernel panic can still be reproduced with the latest kernel from debian 11 and latest drivers:

  • Driver Version: 460.73.01 ppc64
  • CUDA Version: 11.2.2_460.32.03
  • kernel Version: 5.10.40-1 (5.10.0-7-powerpc64le)

Last known working driver is 418.197.02 with CUDA 10.1.243.

As a workaround to use newer versions of CUDA, it is possible to keep using the old 418 driver for the kernel module, and override the user-mode driver from a more recent driver such as 460. This allows using CUDA 11.2 with driver 418. Documentation here:
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatible-upgrade