XLA:gpu system doesn't work.

leekyoj · February 27, 2020, 10:46am

My Env :

CentOS 7.4
Pyhton 3.6.10
tensorflow ==1.15.0

Nvidia driver 396.37 (tesla p100)
Cuda 9.2
cuDnn 9.2-linux-x64-v7.6.5.32

Problem :

I use dual GPU system , that names are “/device:XLA_CPU:0” and “/device:XLA_CPU:1”.
But when I checked the nvidia-smi & training results, it was different from the window PC in the same environment.
So I’m trying to do it with just one GPU.
with K.tfdevice(‘/device:XLA_GPU:1’): Using the command to execute the code, the following error occurred:

2020-02-28 21:47:15.586539: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 21:47:16.538678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:3b:00.0
2020-02-28 21:47:16.539295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:86:00.0
2020-02-28 21:47:16.539406: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcudart.so.10.0’; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64::/opt/python/lib:/APP/enhpc/mpi/openmpi-gcc/lib:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/lib:/opt/python/lib:/APP/enhpc/mpi/openmpi-gcc/lib:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/lib::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
2020-02-28 21:47:16.539470: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcublas.so.10.0’; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64::/opt/python/lib:/APP/enhpc/mpi/openmpi-gcc/lib:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/lib:/opt/python/lib:/APP/enhpc/mpi/openmpi-gcc/lib:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/lib::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
2020-02-28 21:47:16.539518: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcufft.so.10.0’; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64::/opt/python/lib:/APP/enhpc/mpi/openmpi-gcc/lib:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/lib:/opt/python/lib:/APP/enhpc/mpi/openmpi-gcc/lib:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/lib::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
2020-02-28 21:47:16.539565: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcurand.so.10.0’; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64::/opt/python/lib:/APP/enhpc/mpi/openmpi-gcc/lib:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/lib:/opt/python/lib:/APP/enhpc/mpi/openmpi-gcc/lib:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/lib::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
2020-02-28 21:47:16.539612: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcusolver.so.10.0’; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64::/opt/python/lib:/APP/enhpc/mpi/openmpi-gcc/lib:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/lib:/opt/python/lib:/APP/enhpc/mpi/openmpi-gcc/lib:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/lib::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
2020-02-28 21:47:16.539658: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcusparse.so.10.0’; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64::/opt/python/lib:/APP/enhpc/mpi/openmpi-gcc/lib:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/lib:/opt/python/lib:/APP/enhpc/mpi/openmpi-gcc/lib:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/lib::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
2020-02-28 21:47:16.542597: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 21:47:16.542621: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at 使用 pip 安装 TensorFlow for how to download and setup the required libraries for your platform.
Skipping registering GPU devices…
2020-02-28 21:47:16.543187: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-02-28 21:47:16.553819: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2020-02-28 21:47:16.555020: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xcc9c680 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 21:47:16.555042: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-02-28 21:47:16.832426: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xccff6c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-28 21:47:16.832490: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
2020-02-28 21:47:16.832506: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): Tesla P100-PCIE-16GB, Compute Capability 6.0
2020-02-28 21:47:16.832875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-28 21:47:16.832889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]
2020-02-28 21:47:16.913283: I tensorflow/compiler/jit/xla_compilation_cache.cc:238] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2020-02-28 21:47:16.914574: F tensorflow/stream_executor/cuda/cuda_driver.cc:175] Check failed: err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: CUDA driver version is insufficient for CUDA runtime version
2020-02-28 21:47:16.914588: F tensorflow/stream_executor/cuda/cuda_driver.cc:175] Check failed: err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: CUDA driver version is insufficient for CUDA runtime version
2020-02-28 21:47:16.914574: F tensorflow/stream_executor/cuda/cuda_driver.cc:175] Check failed: err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: CUDA driver version is insufficient for CUDA runtime version
Aborted (core dumped)

Attempts for Solutions

tried to install cuda and driver again.
Add environment variables to vi ~/.bashrc

export PATH=/usr/local/cuda-9.2/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64:$LD_LIBRARY_PATH

I am not entirely sure about using XLA gpu for tensor flow. What should I do?

nluehr · February 29, 2020, 12:31am

Tensorflow 1.15.0 requires CUDA 10.1, a 418.39 or newer NVIDIA driver and cuDNN 7.6.
See 使用 pip 安装 TensorFlow for details.

Topic		Replies	Views
CUDA 10.2 & Tensorflow 2.0. Getting an error when testing Tensorflow CUDA Setup and Installation	7	20961	March 20, 2020
Error "Could not load dynamic library" for Tensorflow 2.1.0 with Cuda 10.1 Frameworks tensorflow	1	3678	March 31, 2020
Tensorflow 2.1 with CUDA10.2 warnings .. Frameworks tensorflow	15	17777	July 3, 2020
Unable to use GPU with Tensorflow 2.1 + CUDA 10.1 on Ubuntu 18.04 Linux	3	10004	October 12, 2021
Tensorflow fails to find libcudart CUDA on Windows Subsystem for Linux	7	18734	September 23, 2020
Cuda and tensorflow CUDA Developer Tools	0	1133	September 18, 2020
cuDNN/CUDA/TensorFlow setup prroblem CUDA Setup and Installation	2	1114	March 17, 2020
Running tensorflow without AVX on two xeon X5670 CUDA on Windows Subsystem for Linux	0	999	July 5, 2020
Ptxas returned an error during compilation of ptx to sass: 'Internal: ptxas exited with non-zero error code -1 CUDA Setup and Installation cuda , tensorflow , ai-training	2	5360	January 8, 2024
all CUDA-capable devices are busy or unavailable. What is wrong? cuDNN	10	9714	October 12, 2021

XLA:gpu system doesn't work.

Related topics