Assistance Required for CUDA Initialization Error in TensorFlow

I am reaching out to seek assistance with an issue I am encountering while using TensorFlow with CUDA in my project. I have configured my environment to utilize CUDA for GPU acceleration, but I am facing a CUDA_ERROR_NOT_INITIALIZED error during the initialization process.

Issue Description
The specific error message I am encountering is as follows:


failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error
hostname: aindra-MS-7D99
libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
kernel reported version is: 470.239.6

Despite setting the LD_LIBRARY_PATH environment variable correctly, TensorFlow seems unable to locate libcuda.so. Here is the relevant portion of my Python script:


def register(data_queue, triplet_files, horizontal_files, vertical_files, batch_size, pause_lock):
ld_library_path = os.getenv(“LD_LIBRARY_PATH”)
print(“LD_LIBRARY_PATH:”, ld_library_path)

tx_graph = reg.tx_est_graph(batch_size=batch_size)

gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.6)
with tf.compat.v1.Session(graph=tx_graph, config=tf.compat.v1.ConfigProto(gpu_options=gpu_options)) as sess:
    ld_library_path = os.getenv("LD_LIBRARY_PATH")
    print("LD_LIBRARY_PATH:", ld_library_path)
    init = tx_graph.get_collection("init")
    sess.run(init)

    triplet_txs = reg.compute_triplet_txs(triplet_files, tx_graph, sess, pause_lock)
    horizontal_txs = reg.compute_horizontal_txs(horizontal_files, tx_graph, sess, pause_lock)
    vertical_txs = reg.compute_vertical_txs(vertical_files, tx_graph, sess, pause_lock)

    pair_translations = horizontal_txs + vertical_txs + triplet_txs
    with open('/tmp/pair_tx.pkl', 'wb') as f:
        pickle.dump(pair_translations, f)
    data_queue.put('/tmp/pair_tx.pkl')

sess.close()

Environment Details:
CUDA Version: 11.4
TensorFlow Version: 2.9.0
Operating System: Ubuntu 20.04
CUDA Libraries Path: /usr/local/cuda-11.4/lib64

When we started seeing the issue:
we implimented an process in cpp where we link python and cpp we created a cmakelist.txt to create the libsubprocess.so file which can be used by python file to call the cpp functions


cmake_minimum_required(VERSION 3.10)
project(AddLibrary)

Enable C++17

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)

find_package(OpenCV REQUIRED)

If you use CUDA, uncomment the following line

find_package(CUDA)

include_directories(${OpenCV_INCLUDE_DIRS})

Create a shared library from add.cpp

add_library(add SHARED add.cpp)

Set the output directory for the shared library

set_target_properties(add PROPERTIES
LIBRARY_OUTPUT_DIRECTORY “${CMAKE_BINARY_DIR}/lib”
RUNTIME_OUTPUT_DIRECTORY “${CMAKE_BINARY_DIR}/bin”)

target_link_libraries(add ${OpenCV_LIBS})

Steps Taken

  1. Verified that LD_LIBRARY_PATH is set correctly: which is correctly set
  2. checked the working reverting back without the cpp implementation: It is working normally

Despite these efforts, the initialization error persists. I would greatly appreciate any guidance or suggestions you could provide to help resolve this issue. Specifically, any insights into correctly configuring TensorFlow to locate libcuda.so or any additional debugging steps would be highly valuable.

Thank you for your time and assistance.

Here are few steps , we can take to see libcuda.so is actually present and path to it can be found requested dependencies.

  1. Check CUDA Installation and Path

–)Ensure that CUDA is correctly installed and that libcuda.so is indeed in /usr/local/cuda-11.4/lib64, as you’ve noted. Sometimes installations might not place the library where expected.

–) libcuda.so should be in the directory where NVIDIA drivers are installed, typically under /usr/lib64 or /usr/local/cuda/lib64.
–)We find it by running find /usr -name libcuda.so.
–)If there is no such file, your NVIDIA driver might need to be reinstalled.

  1. Dependencies and linking in Cmake
    –)Check if the find_package(CUDA) is successful. You might need to manually specify the path to CUDA if it’s not found automatically.
    –)Make sure the shared library (add.so) is correctly linking against the necessary CUDA libraries. This can be verified by checking the output of ldd /path/to/add.so.

  2. Environment Variables
    –)LD_LIBRARY_PATH is exported and accessible in the context where the Python script runs. If you're running from an IDE or a system service (like a web server), it might not have the same environment variables as your user session. --) LD_LIBRARY_PATH is an environment variable used in Linux to specify additional directories where shared libraries are searched for.
    –)To ensure it includes the CUDA library path, add the path to libcuda.so by editing your ~/.bashrc or ~/.profile file:
    “export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH”
    –) After editing, run source ~/.bashrc or log out and back in to apply the changes.
    –)Check if it’s set correctly by running echo $LD_LIBRARY_PATH in the terminal.

  3. Check o/p for “nvcc --version”, . If you get an error or no output, CUDA might not be installed correctly.

Please check this and let us know, if everything in place .

hey, Anwesh,
Thank you for you time
I have done all the above steps told by you cuda is working perfectly fine for the first loop, if i run the loop for the second time i am getting this cuda not initializing error. i initially taught this error could be due to tensorflow so i checked with pytorch I’am seeing the same issue

Hey amogh1,
Thanks for the update. Can you please upload/ paste full error message and where exactly you are incurring the error.