Is my Tensorflow install really uses the GPU?

Greetings,

On my Jetson Nano board (4GB RAM, 8GB swap file), I installed Tensorflow (version 2.1.0+nv20.3.tf2) on top of JetPack 4.3, and ensured that the GPU was detected:

from tensorflow.python.client import device_lib

def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == ‘GPU’]

get_available_gpus()

My output was:

tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
tensorflow/compiler/xla/service/service.cc:168] XLA service 0x29536a90 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3 coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.86GiB deviceMemoryBandwidth: 23.84GiB/s
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/device:GPU:0 with 270 MB memory)
→ physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)

Now, when testing MNIST under Jupyter Notebook:

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation=‘relu’),
tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation=‘softmax’)])

model.compile(optimizer=‘adam’, loss=‘sparse_categorical_crossentropy’, metrics=[‘accuracy’])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

The performance of the fitting was quite poor, I thought:

Train on 60000 samples

Epoch 1/5

60000/60000 [==============================] - 35s 586us/sample - loss: 0.3002 - accuracy: 0.9122

Epoch 2/5

60000/60000 [==============================] - 26s 437us/sample - loss: 0.1457 - accuracy: 0.9567

Epoch 3/5

60000/60000 [==============================] - 26s 425us/sample - loss: 0.1053 - accuracy: 0.9676

Epoch 4/5

60000/60000 [==============================] - 26s 434us/sample - loss: 0.0866 - accuracy: 0.9734

Epoch 5/5

60000/60000 [==============================] - 26s 429us/sample - loss: 0.0724 - accuracy: 0.9773

On my laptop (without GPU) I’ve got much better performance:

Train on 60000 samples

Epoch 1/5

60000/60000 [==============================] - 3s 57us/sample - loss: 0.2959 - accuracy: 0.9119

Epoch 2/5

60000/60000 [==============================] - 3s 56us/sample - loss: 0.1453 - accuracy: 0.9564

Epoch 3/5

60000/60000 [==============================] - 4s 63us/sample - loss: 0.1083 - accuracy: 0.9671

Epoch 4/5

60000/60000 [==============================] - 3s 54us/sample - loss: 0.0873 - accuracy: 0.9736

Epoch 5/5

60000/60000 [==============================] - 3s 48us/sample - loss: 0.0757 - accuracy: 0.9760

Surely I don’t have the right settings. What are the best performances that I could expect on Jetson Nano, with an optimal configuration?

Thank you in advance for advising
JP

In order to check if Keras is using GPU according to this you can try:

from keras import backend as K
K.tensorflow_backend._get_available_gpus()

Jetson Nano is designed as a edge computing device for GPU-assisted inference. Bad training performance compared to a current desktop CPU is not surprising.