Tesla P100 SXM2 GPU on power8 `nvidia-smi -q` can not be found

Hi,

I have one IBM power8 server installed with Tesla P100 GPU.

The OS is rhels7.4 ppc64le GA

cat /etc/os-release

NAME=“Red Hat Enterprise Linux Server”
VERSION=“7.4 (Maipo)”

uname -r

3.10.0-693.el7.ppc64le

But, I can not enable both CUDA9.0 and CUDA9.2 on it.

rpm -aq |grep dkms

dkms-2.2.0.3-30.git.7c3e7c5.el7.noarch ==> I can only install successfully for cuda-9.0 with dkms-2.2.0. Other version like 2.5.0 or 2.6.6 have error like /var/lib/dkms/nvidia/384.81/build/common/inc/nv-mm.h: error get_user_pages

rpm -aq |grep cuda-driver

cuda-driver-dev-9-0-9.0.176-1.ppc64le
cuda-drivers-384.81-1.ppc64le ==> The driver is 384.81

But there is NVRM error in dmesg:

dmesg | grep NVRM

[ 2.801928] NVRM: loading NVIDIA UNIX ppc64le Kernel Module 384.81 Sat Sep 2 00:45:52 PDT 2017 (using threaded interrupts)
[ 65.466508] NVRM: RmInitAdapter failed! (0x25:0x5b:1080)
[ 65.466685] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 65.824117] NVRM: RmInitAdapter failed! (0x25:0x5b:1080)
[ 65.824236] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 66.217278] NVRM: RmInitAdapter failed! (0x25:0x5b:1080)
[ 66.217354] NVRM: rm_init_adapter failed for device bearing minor number 2

No GPU can be found

nvidia-smi -q

No devices were found

lspci |grep NVIDIA

0002:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
0003:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
0006:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)

Is there any tips what I can do next? Thx a lot!

Any tips? The problem have blocked for weeks. Is there anybody can help? Thanks a lot!

I would start with a clean load of the OS, get your CUDA and driver installers here:

http://www.nvidia.com/getcuda

and follow the instructions here:

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

carefully. I suggest reading the whole document first. Note that there are power9 specific steps.