Power-9 (ppc64le) - Cuda9.2 - Nvidia driver failures

Hi,
We are using Cuda9 with 384.111 nvidia driver on the Power9 machines, with no issues.
After a try to upgrade the environment to Cuda9.2 with a newer driver, the cards failed to work.
We’ve upgraded FW (as mentioned on Cuda9.2 installation page for Power9 users), to version: OP910.24.

Specs:
Power9 (ppc64le)
OS: RHEL7.5
FW: OP910.24
GPUs: V100

nvidia-smi outputs:

1: With driver: 396.26 , Memory usage throws: Unknown Error

±----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000004:04:00.0 Off | 0 |
| N/A 33C P0 50W / 300W | Unknown Error | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000004:05:00.0 Off | 0 |
| N/A 35C P0 53W / 300W | Unknown Error | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000035:03:00.0 Off | 0 |
| N/A 30C P0 52W / 300W | Unknown Error | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000035:04:00.0 Off | 0 |
| N/A 34C P0 52W / 300W | Unknown Error | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

2: Driver: 396.26 - GPU3 memory: 0MiB / 3072MiB , GPU0 and GPU 2 memory used on idle. GPUs with memory used on idle, not found when trying to use them.

±----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000004:04:00.0 Off | 0 |
| N/A 32C P0 50W / 300W | 30MiB / 15360MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000004:05:00.0 Off | 0 |
| N/A 33C P0 39W / 300W | 0MiB / 15360MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000035:03:00.0 Off | 0 |
| N/A 29C P0 52W / 300W | 15MiB / 15360MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000035:04:00.0 Off | 0 |
| N/A 32C P0 51W / 300W | 0MiB / 3072MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

3: Driver - 396.44 , The same issue, GPUs with memory used on idle, not found when trying to use. After a reboot, message Unknown Error is also possible.

Mon Dec 17 11:02:50 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000004:04:00.0 Off | 0 |
| N/A 32C P0 36W / 200W | 32MiB / 15360MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000004:05:00.0 Off | 0 |
| N/A 35C P0 40W / 200W | 0MiB / 15360MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… On | 00000035:03:00.0 Off | 0 |
| N/A 29C P0 38W / 200W | 14MiB / 15360MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… On | 00000035:04:00.0 Off | 0 |
| N/A 35C P0 39W / 200W | 14MiB / 15360MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Please suggest us about any known solutions / incoming fixes.
Thanks !

Did you modify the udev rule?
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#power9-setup

Furthermore, AFAIK you need to be running kernel-alt.

I’ll try this configuration, thanks for this point !

Can you please provide more information about ‘kernel-alt’ for ppc ?

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/7.5_release_notes/index#new_features_kernel

In short, it takes three things

  • comment out the mentioned udev rule in the rules file
  • run the right (4.14) kernel including devel package and headers
  • make sure the nvidia-persistenced is running as root with persistence mode (i.e. without options)