CUDA driver version is insufficient for CUDA runtime version( custom ngc image )

Hello, I am currently providing a development environment to users by installing jupyterhub with ldap auth based on the ngc ( TensorFlow | NVIDIA NGC ). There was no problem with the previous use, but from 23.01 (cuda 12.0) and later versions, the following error occurs when I log in to the user account through JupyterHub inside the container and proceed tensorflow.

InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Currently, gpu driver ( 450.80.02 ) and CUDA Version (11.0 ) are installed on the node where the container is running( data center GPU like T4, V100, A100…). When checked in a container, the nvidia-smi command shows the same information as the node above, and the command via nvcc --version is identified as cuda-version (12.0.140 ). In other words, the cuda version recognized seems to be different.

However, it was confirmed that the above error did not occur and the tensorflow code operated normally when using the ngc image through the jupyterlab without passing through the jupyterhub even in the same container. There seems to be some change in the process of spawn the notebook through the jupyterhub, but I was wondering if there were any environment variables that I should set in my user account for cuda compatibility or if I need any other additional settings.

2 Likes

Same problem here.

Same condition here.

When key-in nvidia-smi within docker image, it shows:
Thu Jun 15 07:02:29 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:01:00.0 On | N/A |
| 35% 40C P8 7W / 260W | 629MiB / 11264MiB | 4% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

When key-in nvcc --version within docker image, it shows:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

[Error Message]
2023-06-15 06:49:33.281030: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
File “demo1.py”, line 208, in
c1 = tf.keras.layers.Conv2D(16, (3, 3), activation=‘relu’, kernel_initializer=‘he_normal’, padding=‘same’)(s)
File “/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py”, line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File “/usr/local/lib/python3.8/dist-packages/keras/backend.py”, line 2142, in truncated_normal
return tf.random.stateless_truncated_normal(
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version