Failed call to cuInit CUDA_ERROR_NOT_INITIALIZED (Device mapping: no known devices)

svenkatesh.87 · November 9, 2018, 11:42pm

I am not able to execute tensorflow program by referring GPU throwing exception (Failed call to cuInit CUDA_ERROR_NOT_INITIALIZED (Device mapping: no known devices)) tensorflow 1.2.1

Could you check the $PATH? Let me know incase of anything is missing.

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
(tensorflow) [srcAI9@ca47dppaia001 ~]$ echo $PATH
/usr/local/cuda/bin:/home/srcAI9/anaconda3/envs/tensorflow/bin:/home/srcAI9/anaconda3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/srcAI9/.local/bin:/home/srcAI9/bin:/home/srcAI9/anaconda2/bin
(tensorflow) [srcAI9@ca47dppaia001 ~]$ echo $LD_LIBRARY_PATH
:/usr/local/cuda/lib64
(tensorflow) [srcAI9@ca47dppaia001 ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:10:00_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
(tensorflow) [srcAI9@ca47dppaia001 ~]$ cat /usr/local/cuda/version.txt
CUDA Version 10.0.130
(tensorflow) [srcAI9@ca47dppaia001 ~]$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX ppc64le Kernel Module 410.72 Wed Oct 17 20:19:50 CDT 2018
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC)

Thanks,
S.Venkatesh

dcarver · November 20, 2018, 4:24pm

Me too!

power9-08.discovery(1012)# uname -a
Linux power9-08 4.14.0-49.8.1.el7a.ppc64le #1 SMP Mon May 28 07:06:43 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

power9-08.discovery(1013)# systemctl status nvidia-persistenced.service
? nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2018-11-20 10:11:18 CST; 4min 13s ago
Process: 132714 ExecStart=/usr/sbin/nvidia-persistenced --user root (code=exited, status=0/SUCCESS)
Main PID: 132715 (nvidia-persiste)
CGroup: /system.slice/nvidia-persistenced.service
??132715 /usr/sbin/nvidia-persistenced --user root

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

AC922 (8335-GTH) Firmware
power9-08.discovery(1015)# lshw | grep with
product: 8335-GTH (ibm,witherspoon)
version: witherspoon-ibm-OP9-v2.0.8-2.2-prod
power9-08.discovery(1020)#

Robert_Crovella · November 20, 2018, 4:40pm

Did you validate the CUDA install?

Is there a history of previous CUDA software/drivers installed on the machine?

dcarver · November 20, 2018, 5:58pm

First question response:
What do you mean by validating the CUDA install? If you mean to validate by running the Cuda examples from
[url]Installation Guide Linux :: CUDA Toolkit Documentation then no, because nvidia-smi was only reporting Memory-Usage on three of the four V100-SXM2 GPUs. And if the Nvidia utility was failing then why continue with the Cuda examples. I did try multiple reboots and power cycles and check of power9 firmware.

Second question response:
I was running Nvidia drivers 396.44 and Cuda 9.2 without this problem, so removed Nvidia driver 410.72 and Cuda 10.0 and reinstalled Nvidia driver 396.44 and Cuda 9.2.

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
power9-08.discovery(1102)#

BTW, the instructions at [url]Installation Guide Linux :: CUDA Toolkit Documentation

“Use the following command to uninstall a Toolkit runfile installation:
$ sudo /usr/local/cuda-X.Y/bin/uninstall_cuda_X.Y.pl
Use the following command to uninstall a Driver runfile installation:
$ sudo /usr/bin/nvidia-uninstall”

but uninstall_cuda_X.Y.pl and nvidia-uninstall are exist
power9-08.discovery(1097)# find /usr -name “uninstall” | egrep “cuda|nvidia”
/usr/local/cuda-9.2/doc/html/nsightee-plugins-install-guide/graphics/uninstall_plugin.png

so I used the following command

power9-08.discovery(1098)# rpm -e rpm -qa | egrep "cuda|nvidia"

Later,
David

Robert_Crovella · November 20, 2018, 6:39pm

Yes I meant to say “verify” your install, as indicated in the linux install guide:

[url]https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#verify-installation[/url]

The problem with the 410.xx driver may be due to something from the 396.xx driver install that was still left over/not cleaned up.

[url]Installation Guide Linux :: CUDA Toolkit Documentation

The uninstall…pl scripts that you can’t find are not there because you appear to be using a package manager (rpm) install method, not a runfile install method.

The proper method for package manager install removal is indicated in the linux install guide at the above link

dcarver · November 20, 2018, 6:58pm

Thanks, I saw the instructions at[url]Installation Guide Linux :: CUDA Toolkit Documentation for uninstalling, but could not determine the proper package names for Cuda and the Nvidia driver.

“Use the following commands to uninstall a RPM/Deb installation:
$ sudo yum remove <package_name> # Redhat/CentOS
$ sudo dnf remove <package_name> # Fedora
$ sudo zypper remove <package_name> # OpenSUSE/SLES
$ sudo apt-get --purge remove <package_name> # Ubuntu”

So I removed any cuda or nvidia packages (about 293 rpms) with cuda and nvidia in the rpm name.
power9-07.discovery(961)# yum list | egrep “cuda|nvidia” | wc -l
293

Do you know what the correct <package_name> is for Redhat 7.5 or 7.6.

BTW, There is not a Nvidia runfile available for Linux POWER LE RHEL 7 Cuda 9.2 or 10.0 just the rpm.

Robert_Crovella · November 20, 2018, 7:02pm

The package name should be the same as whatever package you installed. Usually it is just cuda

The linux install guide covers all situations, not just Power. Therefore the instructions cover the case where the user is using the runfile install method and also the case where the user is using the package manager install method. In order to follow the proper path through the linux install guide, it’s helpful to keep in mind what kind of install you are using. It’s not suggesting that every single scenario is provided for with both a package manager install method and a runfile install method.

dcarver · November 27, 2018, 11:42pm

Fixed problem with NVIDIA Persistence daemon not updating all four GPUs during booting by putting “/usr/bin/nvidia-smi -pm 1” in /etc/rc.local

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
power9-08.discovery(1028)#

Topic		Replies	Views
After installing CUDA 9.0 in POWER9(RHEL7), nvidia-smi shows Unknown Error in Memory_Usage column. CUDA Setup and Installation	18	3135	June 8, 2018
Nvidia driver conflict CUDA_ERROR_NO_DEVICE Linux	10	9753	June 28, 2018
kernel version 440.31.0 does not match DSO version 440.33.1 — cannot find working devices in this configuration Linux	4	20988	December 12, 2019
Installation on WSL2/Windows 11 problem - can't see GPU CUDA on Windows Subsystem for Linux	11	20296	January 15, 2025
CUDA 10 installation problems on Ubuntu 18.04 CUDA Setup and Installation	24	94583	December 11, 2020
Did TensorFlow caused GPU memory crash? CUDA Setup and Installation	5	4955	April 26, 2017
Trouble downloading CUDA Toolkit 9.0 - Ubuntu 18.04. 'Driver: Not selected' Linux	11	5246	October 12, 2021
ubuntu 18.04.2 CUDA® Toolkit installation use cuda-repo-ubuntu1804-10-1-local-10.1.168-418.67_1.0-1_amd64.deb Linux	1	1575	August 1, 2019
Install CUDA-9 on Ubuntu 16.04 with the runfile and pre-installed drivers CUDA Setup and Installation	15	58595	February 28, 2020
Tensorflow coredump no supported devices found for CUDA (Docker nvcr.io container), after reboot nvidia-smi can't find driver Linux cuda , tensorflow	2	2577	October 8, 2020

Failed call to cuInit CUDA_ERROR_NOT_INITIALIZED (Device mapping: no known devices)

Related topics