Solved: Tesla T4 GPUs Tensorflow 2.1 RuntimeError uknown device on Cisco UCS c240 M5 system

dave.urschatz · April 16, 2020, 2:16am

What could be causing the Tesla T4 GPUs to not be listed as device_type=‘GPU’ in the output of:
tf.config.list_physical_devices()?

This results in “uknown device” when Tensorflow tries to access the GPUs:

with tf.device('/gpu:3'): a = tf.constant(3.0)
# Output
...
RuntimeError: /job:localhost/replica:0/task:0/device:GPU:3 unknown device.

What more can I do to troubleshoot this further?

Environment containing this issue:

Cisco UCS c240 M5 system has:
** 72x CPUs and
** 5x Tesla T4 GPUs
Ubuntu 18.04 LTS
NVIDIA-SMI 418.126.02 Driver Version: 418.126.02 CUDA Version: 10.1
** All 5 T4 are recognized by nvidia-smi
Tensorflow 2.1.0
libcudnn7_7.6.5.32-1+cuda10.1_amd64

    import tensorflow as tf
    tf.config.list_physical_devices()
# Output
    [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
    PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:1', device_type='XLA_GPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:2', device_type='XLA_GPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:3', device_type='XLA_GPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:4', device_type='XLA_GPU')]

    tf.config.list_physical_devices('GPU')
# Output
    []

    from tensorflow.python.client import device_lib
    print(device_lib.list_local_devices())
# Output
    [name: "/device:CPU:0"
    device_type: "CPU"
    memory_limit: 268435456
    locality {
    }
    incarnation: 13771363116588327167
    , name: "/device:XLA_CPU:0"
    device_type: "XLA_CPU"
    memory_limit: 17179869184
    locality {
    }
    incarnation: 17218183889029182531
    physical_device_desc: "device: XLA_CPU device"
    , name: "/device:XLA_GPU:0"
    device_type: "XLA_GPU"
    memory_limit: 17179869184
    locality {
    }
    incarnation: 9157258922701839704
    physical_device_desc: "device: XLA_GPU device"
    , name: "/device:XLA_GPU:1"
    device_type: "XLA_GPU"
    memory_limit: 17179869184
    locality {
    }
    incarnation: 4158970543181084654
    physical_device_desc: "device: XLA_GPU device"
    , name: "/device:XLA_GPU:2"
    device_type: "XLA_GPU"
    memory_limit: 17179869184
    locality {
    }
    incarnation: 15403740508526850072
    physical_device_desc: "device: XLA_GPU device"
    , name: "/device:XLA_GPU:3"
    device_type: "XLA_GPU"
    memory_limit: 17179869184
    locality {
    }
    incarnation: 9287480476894551351
    physical_device_desc: "device: XLA_GPU device"
    , name: "/device:XLA_GPU:4"
    device_type: "XLA_GPU"
    memory_limit: 17179869184
    locality {
    }
    incarnation: 15439875423529567742
    physical_device_desc: "device: XLA_GPU device"
    ]

dave.urschatz · April 17, 2020, 12:28pm

This issue was caused by missing cuda library files.

Solution was:

sudo apt remove --autoremove nvidia-cuda-toolkit
sudo apt update
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-key adv --fetch-keys  http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list'
sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda_learn.list'

Install CUDA 10.1 packages

sudo apt update
sudo apt install cuda-10-1
sudo apt install libcudnn7

Reboot the system.

Tesla t4 GPUs are now showing up as device_type=‘GPU’ !

import tensorflow as tf
    tf.config.list_physical_devices()
# Output
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),
 PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),
 PhysicalDevice(name='/physical_device:XLA_GPU:1', device_type='XLA_GPU'),
 PhysicalDevice(name='/physical_device:XLA_GPU:2', device_type='XLA_GPU'),
 PhysicalDevice(name='/physical_device:XLA_GPU:3', device_type='XLA_GPU'),
 PhysicalDevice(name='/physical_device:XLA_GPU:4', device_type='XLA_GPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:4', device_type='GPU')]

Tensors can now be placed on the T4 GPUs.

with tf.device('/device:GPU:3'):
  a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)
# Output
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

Hope this helps someone else.

Dave

Topic		Replies	Views
Can't identify the cuda device TensorRT tensorrt , cuda , tensorflow	6	940	April 13, 2020
failure to set vgpu computing mode from prohibited to default Linux	11	3989	September 19, 2022
ubuntu 16.04, python3.6.6, tensorflow samplecode invoke error. cudaGetDevice() failed. please help me. CUDA Setup and Installation	1	811	August 22, 2018
XLA:gpu system doesn't work. Frameworks tensorflow	1	1720	February 29, 2020
pycuda._driver.LogicError: explicit_context_dependent failed: invalid device context - no currently active context? TensorRT	7	3937	October 22, 2018
Tensorflow is not recognising the gpu TensorRT	7	2100	July 15, 2024
Failed call to cuInit CUDA_ERROR_NOT_INITIALIZED (Device mapping: no known devices) CUDA Setup and Installation	7	6439	November 27, 2018
Tensorflow fails to find libcudart CUDA on Windows Subsystem for Linux	7	18668	September 23, 2020
[SOLVED] TensorFlow won't detect my CUDA-enabled GPU in WSL2 CUDA on Windows Subsystem for Linux cuda , tensorflow , ubuntu , python , wsl	2	15332	April 29, 2022
tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN Jetson TX2	4	2706	October 18, 2021

Solved: Tesla T4 GPUs Tensorflow 2.1 RuntimeError uknown device on Cisco UCS c240 M5 system

Related topics