Recent nvidia Tesla drivers cause system crashs on POWERNVL w/ P100 GPUs

pierre.neyron · December 2, 2020, 2:59pm

Recent Nvidia Tesla drivers cause system crashs on POWERNVL systems with P100 GPUs, which impacts the capability to upgrade to Cuda Toolkit > 10.1

Affected hardware: IBM Power Systems S822LC for HPC (“Minsky”) with POWER8 CPU and 4x P100-SXM2-16GB GPU with NVLink

Affected software: reproduced with nvidia driver 450.80.02, 450.51.06, 440.118.02, and Linux kernel 4.15.0-124-generic (Ubuntu 18.04), 4.19.152-1 (Debian 10), 5.8.10-1~bpo10+1 (Debian 10).

This appears to be a regression : a workaround is to downgrade to nvidia driver 418.165.02 or 410.129, which unfortunately limits to using Cuda ≤10.1

How to reproduce: install nvidia driver from .run file, and run “nvidia-smi” in a loop. It can also be reproduced with “lstopo” command from “hwloc” package. System crash (kernel panic) will occur within 5 to 30 minutes and involves the “nvidia_uvm” module.

Example of kernel panic:

Firmware versions on the system:
IBM-garrison-OP8_v1.12_2.96
op-build-v2.3-7-g99a6bc8
buildroot-2019.02.1-16-ge01dcd0
skiboot-v6.3.1
hostboot-p8-c893515-pd6f049d
occ-p8-a2856b7
linux-5.0.7-openpower1-p8e31f00
petitboot-v1.10.3
machine-xml-c5c3
(latest IBM firmware package: 8335GTB_820.1923 - OP820.30 - 07/01/2019)

baptiste.jonglez · July 8, 2021, 9:22am

The kernel panic can still be reproduced with the latest kernel from debian 11 and latest drivers:

Driver Version: 460.73.01 ppc64
CUDA Version: 11.2.2_460.32.03
kernel Version: 5.10.40-1 (5.10.0-7-powerpc64le)

Last known working driver is 418.197.02 with CUDA 10.1.243.

As a workaround to use newer versions of CUDA, it is possible to keep using the old 418 driver for the kernel module, and override the user-mode driver from a more recent driver such as 460. This allows using CUDA 11.2 with driver 418. Documentation here:
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatible-upgrade

Topic		Replies	Views
Installing driver fails for Tesla V100 Linux	3	3749	October 12, 2021
1080 Ti always dies shortly after strarting training, cuda 11.5, driver 495.29.05 Drivers - Linux, Windows, MacOS cuda	2	757	January 31, 2022
Tesla P100 on PC Drivers - Linux, Windows, MacOS	6	4470	June 7, 2023
Failures on Ubuntu running with Nvidia 1070Ti Linux	1	1013	March 12, 2018
Hard crash using CUDA on GTX 1080 Ti on Ubuntu 16.04 CUDA Setup and Installation	8	4862	September 25, 2017
Tesla P40 in Dell Percision 7910 rack CUDA Programming and Performance	10	2410	February 16, 2024
Evertyime I load Nvidia Driver 375.51 or 375.66 the P100 goes offline in about 8 minutes. Linux	4	937	May 15, 2017
IBM Power8: CUDA driver version is insufficient for CUDA runtime version CUDA Setup and Installation	9	3155	December 1, 2016
Kernel Panics on CentOS7 - Geforce GTX 1080 Ti with Nvidia Driver 384.59 Linux	7	3334	December 5, 2017
Unable to install Tesla V100 GPU drivers on Ubuntu 20.04 Tesla Boards cuda , kernel , ubuntu	1	1737	February 22, 2024

Recent nvidia Tesla drivers cause system crashs on POWERNVL w/ P100 GPUs

Related topics