Configuring multiple versions of TensorRT and Tensorflow on HPC share cluster; TF-TRT Warning: Cannot dlopen some TensorRT libraries

rk3199 · March 30, 2023, 4:07pm

We use Bright Computing for provisioning nodes on RHEL 9 and have cuda 11.7 and cuda 11.8 available as modules, as well as cudnn 8.5 for cuda 11.7 and cudnn 8.8 for cuda 11.8. I also created a module for cutensor-cuda11.7.

We also have various modules for Python, e.g., mamba with Python 3.11, Anaconda Python 3.9.10. Tensorflow 2.11.0 was installed via pip with --user.

NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 and NVIDIA RTX A6000 are the GPUs

What’s the reason for TF not finding the GPUs?

2023-03-30 11:54:34.772791: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

2023-03-30 11:54:35.566539: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /cm/shared/apps/cutensor-cuda11.7/1.3.1.3/lib/11:/cm/local/apps/cuda/libs/current/lib64:/cm/shared/apps/cuda11.7/toolkit/11.7.1/targets/x86_64-linux/lib:/cm/shared/apps/slurm/current/lib64/slurm:/cm/shared/apps/slurm/current/lib64

2023-03-30 11:54:35.566613: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /cm/shared/apps/cutensor-cuda11.7/1.3.1.3/lib/11:/cm/local/apps/cuda/libs/current/lib64:/cm/shared/apps/cuda11.7/toolkit/11.7.1/targets/x86_64-linux/lib:/cm/shared/apps/slurm/current/lib64/slurm:/cm/shared/apps/slurm/current/lib64

2023-03-30 11:54:35.566627: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

>>> print(tf.__version__)

2.11.0

python mnist.py

2023-03-30 11:46:44.803644: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

2023-03-30 11:46:45.605164: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /cm/shared/apps/cutensor-cuda11.7/1.3.1.3/lib/11:/cm/local/apps/cuda/libs/current/lib64:/cm/shared/apps/cuda11.7/toolkit/11.7.1/targets/x86_64-linux/lib:/cm/shared/apps/slurm/current/lib64/slurm:/cm/shared/apps/slurm/current/lib64

2023-03-30 11:46:45.605449: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /cm/shared/apps/cutensor-cuda11.7/1.3.1.3/lib/11:/cm/local/apps/cuda/libs/current/lib64:/cm/shared/apps/cuda11.7/toolkit/11.7.1/targets/x86_64-linux/lib:/cm/shared/apps/slurm/current/lib64/slurm:/cm/shared/apps/slurm/current/lib64

2023-03-30 11:46:45.605462: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

2023-03-30 11:46:47.410968: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /cm/shared/apps/cutensor-cuda11.7/1.3.1.3/lib/11:/cm/local/apps/cuda/libs/current/lib64:/cm/shared/apps/cuda11.7/toolkit/11.7.1/targets/x86_64-linux/lib:/cm/shared/apps/slurm/current/lib64/slurm:/cm/shared/apps/slurm/current/lib64

2023-03-30 11:46:47.411008: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

Skipping registering GPU devices...

But at least on cuda 11.8 the GPU is found:

python mnist.py

2023-03-30 12:04:52.263547: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.

To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

2023-03-30 12:04:53.184678: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

2023-03-30 12:04:55.144931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 46672 MB memory: -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:c1:00.0, compute capability: 8.6

Epoch 1/10

2023-03-30 12:04:56.798351: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:637] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

2023-03-30 12:04:56.967906: I tensorflow/compiler/xla/service/service.cc:169] XLA service 0x1547d7dfd390 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

2023-03-30 12:04:56.967960: I tensorflow/compiler/xla/service/service.cc:177] StreamExecutor device (0): NVIDIA RTX A6000, Compute Capability 8.6

2023-03-30 12:04:56.971514: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.

2023-03-30 12:04:57.084710: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8801

2023-03-30 12:04:57.094134: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:530] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.

Searched for CUDA in the following directories:

./cuda_sdk_lib

/usr/local/cuda-11.8

/usr/local/cuda

.

You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.

2023-03-30 12:04:57.094329: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc

2023-03-30 12:04:57.094586: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:362 : INTERNAL: libdevice not found at ./libdevice.10.bc

2023-03-30 12:04:57.094610: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: libdevice not found at ./libdevice.10.bc

[[{{node StatefulPartitionedCall_2}}]]

2023-03-30 12:04:57.111179: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc

2023-03-30 12:04:57.111354: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:362 : INTERNAL: libdevice not found at ./libdevice.10.bc

2023-03-30 12:04:57.155357: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc

2023-03-30 12:04:57.155587: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:362 : INTERNAL: libdevice not found at ./libdevice.10.bc

2023-03-30 12:04:57.171499: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc

2023-03-30 12:04:57.171675: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:362 : INTERNAL: libdevice not found at ./libdevice.10.bc

AakankshaS · March 30, 2023, 4:37pm

Hi,
Please check the below links, as they might answer your concerns.

Thanks!

rk3199 · March 30, 2023, 5:20pm

Nothing there about multiple versions. Any other specific suggestions?

AakankshaS · March 31, 2023, 7:57am

Hi @rk3199 ,

We are checking on this. Will update you on the same.

Thanks

AakankshaS · April 28, 2023, 6:22am

Hi @rk3199 ,
Did you try completely removing CUDA and reinstall it again.

Thanks

rk3199 · May 3, 2023, 1:46pm

No as this is in a cluster a loaded as a module.

Lower versions of Python,. e.g., 3.7 does not generate this error/warning.

AakankshaS · May 31, 2023, 8:17am

Hi @rk3199 ,
Can you please share the TRT version you are using?
Also if you can try an upgrade and let us know if issue is still there?

Thanks

rk3199 · June 1, 2023, 6:00pm

pip list | grep -i tensorrt
WARNING: Ignoring invalid distribution -ensorflow (/path/to/me/.local/lib/python3.9/site-packages)
nvidia-tensorrt 8.4.3.1
tensorrt 8.6.1
tensorrt-bindings 8.6.1
tensorrt-dispatch 8.6.0
tensorrt-lean 8.6.0
tensorrt-libs 8.6.1

Upgrade what? We use modules so I can install specific versions.

azd.rzzd · June 28, 2023, 10:26am

Hi, I have the same problem using pyhton 3.9.4, cuda and cudnn 11.5 on HPC. Any solution so far?

Topic		Replies	Views
Tensorflow 2.1 with CUDA10.2 warnings .. Frameworks tensorflow	15	17765	July 3, 2020
Installation on WSL2/Windows 11 problem - can't see GPU CUDA on Windows Subsystem for Linux	11	20859	January 15, 2025
Tf-trt conversion got killed TensorRT tensorrt , tensorflow , jetson-inference	3	750	April 22, 2021
all CUDA-capable devices are busy or unavailable. What is wrong? cuDNN	10	9694	October 12, 2021
Tensorflow fails to find libcudart CUDA on Windows Subsystem for Linux	7	18712	September 23, 2020
Tensorflow looking for TensorRT 6, when I have TensorRT 7 TensorRT	3	635	October 12, 2021
TF-TRT issue Jetson TX2	26	3834	October 18, 2021
run tensorflow 1.3 on tx2 stuck Jetson TX2	20	5596	October 18, 2021
TensorRT 8.0.0 network installation issue TensorRT	5	1560	May 11, 2021
CUDA 10.2 & Tensorflow 2.0. Getting an error when testing Tensorflow CUDA Setup and Installation	7	20953	March 20, 2020

Configuring multiple versions of TensorRT and Tensorflow on HPC share cluster; TF-TRT Warning: Cannot dlopen some TensorRT libraries

Related topics