Getting “CUDA_ERROR_INVALID_VALUE: invalid argument” in python with Tensorflow 1.14

avisekvandy · April 23, 2020, 9:37pm

Some Information
python: 3.6.9
tensorflow-gpu==1.14.0
protobuf==3.11.3
tensorflow-estimator==1.14.0

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

$ nvidia-smi
Thu Apr 23 13:22:06 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:B3:00.0 Off |                  N/A |
| 26%   28C    P8    12W / 250W |    119MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1277      G   /usr/lib/xorg/Xorg                            39MiB |
|    0      1388      G   /usr/bin/gnome-shell                          77MiB |
+-----------------------------------------------------------------------------+

When I run the snippet below, as python test.py

import os
# Enable '0' or disable '-1' GPU use
 os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ['CUDA_VISIBLE_DEVICES'] = "0"
import warnings

with warnings.catch_warnings():
	warnings.filterwarnings("ignore", category=FutureWarning)
	import tensorflow as tf
	config = tf.compat.v1.ConfigProto()
	# config.gpu_options.visible_device_list = "0"  # pylint: disable=no-member
	config.gpu_options.allow_growth = True  # pylint: disable=no-member
	session = tf.compat.v1.Session(config=config)

# check if successfully using GPU
if tf.test.gpu_device_name():
	print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
	print('GPU not being used')

I get the following error

2020-04-23 13:13:15.969352: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-23 13:13:15.974088: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-04-23 13:13:15.990122: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_VALUE: invalid argument
2020-04-23 13:13:15.990240: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA
Aborted (core dumped)

When I set os.environ['CUDA_VISIBLE_DEVICES'] = "-1"(ie no GPU use), there is no error and the output is as expected shown below.

2020-04-23 13:18:24.911806: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-23 13:18:24.916849: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-04-23 13:18:24.920347: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-04-23 13:18:24.920384: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: vumacs
2020-04-23 13:18:24.920389: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: vumacs
2020-04-23 13:18:24.920456: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 440.64.0
2020-04-23 13:18:24.920482: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.64.0
2020-04-23 13:18:24.920489: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 440.64.0
2020-04-23 13:18:24.938734: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3299990000 Hz
2020-04-23 13:18:24.939659: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4849f40 executing computations on platform Host. Devices:
2020-04-23 13:18:24.939686: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
GPU not being used

Is there any way to resolve this erro since I previously used the same code by setting CUDA_VISIBLE_DEVICES to 0 both through the script as well as shell and there were no issues. The error seems to be occuring when setting the session with tf.compat.v1.Session(config=config)

Providing a few further logs

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

2020-04-23 15:32:47.855593: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-04-23 15:32:47.884652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:b3:00.0
2020-04-23 15:32:47.885146: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-04-23 15:32:47.886730: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-04-23 15:32:47.888298: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2020-04-23 15:32:47.888855: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2020-04-23 15:32:47.890673: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2020-04-23 15:32:47.892068: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2020-04-23 15:32:47.895348: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-04-23 15:32:47.896233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
Num GPUs Available:  1

But then executing this line gives me the same error

tf.test.gpu_device_name()

2020-04-23 15:34:50.948097: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-23 15:34:50.983906: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_VALUE: invalid argument
2020-04-23 15:34:50.984119: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA
Aborted (core dumped)

SunilJB · April 24, 2020, 5:20am

Hi,
It seems to be due to incompatible CUDA version.
Please refer below link for more details:
https://github.com/tensorflow/tensorflow/issues/35926#issuecomment-575258175

Thanks

avisekvandy · April 24, 2020, 12:44pm

If you look at nvcc -v command I have included, I am using CUDA 10.0 with TF 1.14 here which are compatible. As far as I know there can be different versions of CUDA for the GPU for graphics and for compiling with tf. Also in the tf/Cuda logs included at the end of my post, you will notice that it successfully opens libcudart.so.10.0 and other 10.0 versions of libcuda*.so.10.0. In fact, I had earlier used the exact installation without issues. There has been no upgrades or updates to the tf installation, graphics driver, cuda, cudnn. It suddenly showing this error.

SunilJB · April 29, 2020, 11:52am

As per nvidia-smi command output it seems that CUDA 10.2 is installed on your setup.
Could you please try to downgrade it to CUDA 10.0?

Thanks

Topic		Replies	Views
CUDA 10.2 & Tensorflow 2.0. Getting an error when testing Tensorflow CUDA Setup and Installation	7	20877	March 20, 2020
cuda_driver.cc:175] Check failed: err == cudaSuccess \|\| err == cudaErrorInvalidValue Unexpected CUDA error: invalid argument CUDA Setup and Installation	2	3632	January 6, 2021
CUDNN_STATUS_INTERNAL_ERROR in gtx 1650 CUDA Developer Tools	0	888	October 25, 2020
Failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error : Ubuntu 20.04.2, RTX 2070 SUPER GPU Linux tensorflow	0	2408	May 30, 2021
cuDNN/CUDA/TensorFlow setup prroblem CUDA Setup and Installation	2	1096	March 17, 2020
CUDA driver version is insufficient for CUDA runtime version with nvidia driver 390 CUDA Setup and Installation	3	6869	January 4, 2019
CUDA might not be working properly and other warnings CUDA Programming and Performance	8	1673	July 1, 2018
CUDA driver version is insufficient for CUDA runtime version CUDA Setup and Installation	0	1825	May 13, 2018
Intermittent CUDA_ERROR_ILLEGAL_ADDRESS error on Ubuntu 18.04 with TensorFlow 2.2.0 Frameworks cuda , tensorflow	3	7816	January 5, 2023
Not able to run AI workloads on H100 GPU AI Foundation Models and Endpoints tensorrt , cuda , tensorflow , kernel , ubuntu , cudnn , rapids	6	874	December 28, 2024

Getting “CUDA_ERROR_INVALID_VALUE: invalid argument” in python with Tensorflow 1.14

Related topics