Solved: Tesla T4 GPUs Tensorflow 2.1 RuntimeError uknown device on Cisco UCS c240 M5 system

What could be causing the Tesla T4 GPUs to not be listed as device_type=‘GPU’ in the output of:
tf.config.list_physical_devices()?

This results in “uknown device” when Tensorflow tries to access the GPUs:

with tf.device('/gpu:3'): a = tf.constant(3.0)
# Output
...
RuntimeError: /job:localhost/replica:0/task:0/device:GPU:3 unknown device.

What more can I do to troubleshoot this further?

Environment containing this issue:

  • Cisco UCS c240 M5 system has:
    ** 72x CPUs and
    ** 5x Tesla T4 GPUs
  • Ubuntu 18.04 LTS
  • NVIDIA-SMI 418.126.02 Driver Version: 418.126.02 CUDA Version: 10.1
    ** All 5 T4 are recognized by nvidia-smi
  • Tensorflow 2.1.0
  • libcudnn7_7.6.5.32-1+cuda10.1_amd64
    import tensorflow as tf
    tf.config.list_physical_devices()
# Output
    [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
    PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:1', device_type='XLA_GPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:2', device_type='XLA_GPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:3', device_type='XLA_GPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:4', device_type='XLA_GPU')]

    tf.config.list_physical_devices('GPU')
# Output
    []

    from tensorflow.python.client import device_lib
    print(device_lib.list_local_devices())
# Output
    [name: "/device:CPU:0"
    device_type: "CPU"
    memory_limit: 268435456
    locality {
    }
    incarnation: 13771363116588327167
    , name: "/device:XLA_CPU:0"
    device_type: "XLA_CPU"
    memory_limit: 17179869184
    locality {
    }
    incarnation: 17218183889029182531
    physical_device_desc: "device: XLA_CPU device"
    , name: "/device:XLA_GPU:0"
    device_type: "XLA_GPU"
    memory_limit: 17179869184
    locality {
    }
    incarnation: 9157258922701839704
    physical_device_desc: "device: XLA_GPU device"
    , name: "/device:XLA_GPU:1"
    device_type: "XLA_GPU"
    memory_limit: 17179869184
    locality {
    }
    incarnation: 4158970543181084654
    physical_device_desc: "device: XLA_GPU device"
    , name: "/device:XLA_GPU:2"
    device_type: "XLA_GPU"
    memory_limit: 17179869184
    locality {
    }
    incarnation: 15403740508526850072
    physical_device_desc: "device: XLA_GPU device"
    , name: "/device:XLA_GPU:3"
    device_type: "XLA_GPU"
    memory_limit: 17179869184
    locality {
    }
    incarnation: 9287480476894551351
    physical_device_desc: "device: XLA_GPU device"
    , name: "/device:XLA_GPU:4"
    device_type: "XLA_GPU"
    memory_limit: 17179869184
    locality {
    }
    incarnation: 15439875423529567742
    physical_device_desc: "device: XLA_GPU device"
    ]

This issue was caused by missing cuda library files.

Solution was:

sudo apt remove --autoremove nvidia-cuda-toolkit
sudo apt update
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-key adv --fetch-keys  http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list'
sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda_learn.list'

Install CUDA 10.1 packages

sudo apt update
sudo apt install cuda-10-1
sudo apt install libcudnn7

Reboot the system.

Tesla t4 GPUs are now showing up as device_type=‘GPU’ !

import tensorflow as tf
    tf.config.list_physical_devices()
# Output
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),
 PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),
 PhysicalDevice(name='/physical_device:XLA_GPU:1', device_type='XLA_GPU'),
 PhysicalDevice(name='/physical_device:XLA_GPU:2', device_type='XLA_GPU'),
 PhysicalDevice(name='/physical_device:XLA_GPU:3', device_type='XLA_GPU'),
 PhysicalDevice(name='/physical_device:XLA_GPU:4', device_type='XLA_GPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:4', device_type='GPU')]

Tensors can now be placed on the T4 GPUs.

with tf.device('/device:GPU:3'):
  a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)
# Output
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

Hope this helps someone else.

Dave