Failed call to cuInit CUDA_ERROR_NOT_INITIALIZED (Device mapping: no known devices)

I am not able to execute tensorflow program by referring GPU throwing exception (Failed call to cuInit CUDA_ERROR_NOT_INITIALIZED (Device mapping: no known devices)) tensorflow 1.2.1

Could you check the $PATH? Let me know incase of anything is missing.

(tensorflow) [srcAI9@ca47dppaia001 ~]$ nvidia-smi
Fri Nov 9 15:35:53 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000004:04:00.0 Off | 0 |
| N/A 29C P0 47W / 300W | Unknown Error | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000035:03:00.0 Off | 0 |
| N/A 29C P0 49W / 300W | Unknown Error | 4% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
(tensorflow) [srcAI9@ca47dppaia001 ~]$ echo PATH /usr/local/cuda/bin:/home/srcAI9/anaconda3/envs/tensorflow/bin:/home/srcAI9/anaconda3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/srcAI9/.local/bin:/home/srcAI9/bin:/home/srcAI9/anaconda2/bin (tensorflow) [srcAI9@ca47dppaia001 ~] echo LD_LIBRARY_PATH :/usr/local/cuda/lib64 (tensorflow) [srcAI9@ca47dppaia001 ~] nvcc --version
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:10:00_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
(tensorflow) [srcAI9@ca47dppaia001 ~] cat /usr/local/cuda/version.txt CUDA Version 10.0.130 (tensorflow) [srcAI9@ca47dppaia001 ~] cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX ppc64le Kernel Module 410.72 Wed Oct 17 20:19:50 CDT 2018
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC)

Thanks,
S.Venkatesh

Me too!

power9-08.discovery(1012)# uname -a
Linux power9-08 4.14.0-49.8.1.el7a.ppc64le #1 SMP Mon May 28 07:06:43 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

power9-08.discovery(1013)# systemctl status nvidia-persistenced.service
? nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2018-11-20 10:11:18 CST; 4min 13s ago
Process: 132714 ExecStart=/usr/sbin/nvidia-persistenced --user root (code=exited, status=0/SUCCESS)
Main PID: 132715 (nvidia-persiste)
CGroup: /system.slice/nvidia-persistenced.service
??132715 /usr/sbin/nvidia-persistenced --user root

Nov 20 10:10:54 power9-08 systemd[1]: Starting NVIDIA Persistence Daemon…
Nov 20 10:11:18 power9-08 systemd[1]: Started NVIDIA Persistence Daemon.
power9-08.discovery(1014)# nvidia-smi
Tue Nov 20 10:15:39 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000004:04:00.0 Off | 0 |
| N/A 34C P0 39W / 300W | 0MiB / 16128MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000004:05:00.0 Off | 0 |
| N/A 39C P0 39W / 300W | 0MiB / 16128MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000035:03:00.0 Off | 0 |
| N/A 35C P0 54W / 300W | Unknown Error | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… On | 00000035:04:00.0 Off | 0 |
| N/A 37C P0 39W / 300W | 0MiB / 16128MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

AC922 (8335-GTH) Firmware
power9-08.discovery(1015)# lshw | grep with
product: 8335-GTH (ibm,witherspoon)
version: witherspoon-ibm-OP9-v2.0.8-2.2-prod
power9-08.discovery(1020)#

Did you validate the CUDA install?

Is there a history of previous CUDA software/drivers installed on the machine?

First question response:
What do you mean by validating the CUDA install? If you mean to validate by running the Cuda examples from
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html then no, because nvidia-smi was only reporting Memory-Usage on three of the four V100-SXM2 GPUs. And if the Nvidia utility was failing then why continue with the Cuda examples. I did try multiple reboots and power cycles and check of power9 firmware.

Second question response:
I was running Nvidia drivers 396.44 and Cuda 9.2 without this problem, so removed Nvidia driver 410.72 and Cuda 10.0 and reinstalled Nvidia driver 396.44 and Cuda 9.2.

power9-08.discovery(1100)# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX ppc64le Kernel Module 396.44 Wed Jul 11 17:17:20 PDT 2018
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC)
power9-08.discovery(1101)# nvidia-smi
Tue Nov 20 11:50:16 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000004:04:00.0 Off | 0 |
| N/A 33C P0 39W / 300W | 0MiB / 16128MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000004:05:00.0 Off | 0 |
| N/A 37C P0 38W / 300W | 0MiB / 16128MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… On | 00000035:03:00.0 Off | 0 |
| N/A 33C P0 40W / 300W | 0MiB / 16128MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… On | 00000035:04:00.0 Off | 0 |
| N/A 35C P0 39W / 300W | 0MiB / 16128MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
power9-08.discovery(1102)#

BTW, the instructions at https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

Use the following command to uninstall a Toolkit runfile installation:
sudo /usr/local/cuda-X.Y/bin/uninstall_cuda_X.Y.pl Use the following command to uninstall a Driver runfile installation: sudo /usr/bin/nvidia-uninstall

but uninstall_cuda_X.Y.pl and nvidia-uninstall are exist
power9-08.discovery(1097)# find /usr -name “uninstall” | egrep “cuda|nvidia”
/usr/local/cuda-9.2/doc/html/nsightee-plugins-install-guide/graphics/uninstall_plugin.png

so I used the following command

power9-08.discovery(1098)# rpm -e rpm -qa | egrep "cuda|nvidia"

Later,
David

Yes I meant to say “verify” your install, as indicated in the linux install guide:

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#verify-installation

The problem with the 410.xx driver may be due to something from the 396.xx driver install that was still left over/not cleaned up.

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#handle-uninstallation

The uninstall…pl scripts that you can’t find are not there because you appear to be using a package manager (rpm) install method, not a runfile install method.

The proper method for package manager install removal is indicated in the linux install guide at the above link

Thanks, I saw the instructions athttps://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#handle-uninstallation for uninstalling, but could not determine the proper package names for Cuda and the Nvidia driver.

Use the following commands to uninstall a RPM/Deb installation:
sudo yum remove <package_name> # Redhat/CentOS sudo dnf remove <package_name> # Fedora
sudo zypper remove <package_name> # OpenSUSE/SLES sudo apt-get --purge remove <package_name> # Ubuntu

So I removed any cuda or nvidia packages (about 293 rpms) with cuda and nvidia in the rpm name.
power9-07.discovery(961)# yum list | egrep “cuda|nvidia” | wc -l
293

Do you know what the correct <package_name> is for Redhat 7.5 or 7.6.

BTW, There is not a Nvidia runfile available for Linux POWER LE RHEL 7 Cuda 9.2 or 10.0 just the rpm.

The package name should be the same as whatever package you installed. Usually it is just cuda

The linux install guide covers all situations, not just Power. Therefore the instructions cover the case where the user is using the runfile install method and also the case where the user is using the package manager install method. In order to follow the proper path through the linux install guide, it’s helpful to keep in mind what kind of install you are using. It’s not suggesting that every single scenario is provided for with both a package manager install method and a runfile install method.

Fixed problem with NVIDIA Persistence daemon not updating all four GPUs during booting by putting “/usr/bin/nvidia-smi -pm 1” in /etc/rc.local

power9-08.discovery(1027)# nvidia-smi
Tue Nov 27 17:39:53 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000004:04:00.0 Off | 0 |
| N/A 32C P0 39W / 300W | 0MiB / 16128MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000004:05:00.0 Off | 0 |
| N/A 36C P0 38W / 300W | 0MiB / 16128MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… On | 00000035:03:00.0 Off | 0 |
| N/A 32C P0 40W / 300W | 0MiB / 16128MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… On | 00000035:04:00.0 Off | 0 |
| N/A 35C P0 38W / 300W | 0MiB / 16128MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
power9-08.discovery(1028)#