Jetson Xavier NX - Tensorflow 2 container slower on GPU than on CPU

Hi everyone, this week I received my Jetson Xavier NX developer board and started playing a bit with it.

I found-out that NVidia provides a Docker image based on L4T with Tensorflow 1 installed. I used it’s Dockerfile and created a similar container with Tensorflow 2. The new Dockerfile is here and the image on Dockerhub with tag carlosedp/l4t-tensorflow:r32.4.2-tf1-py3.

While testing it with Tensorflow “hello world” sample below from Tensorflow site, I found out two things:

  1. The time to run the sample with and without the GPU (I’ve enabled or disabled the nvidia-runtime on Docker), the runtimes are pretty much equal or slower on GPU.
  2. By using the jtop utility, the GPU never goes above 35% and the frequency stays at the minimum 114MHz (seen on jtop).

I wonder why the GPU is almost similar to the CPU times and why the runtime doesn’t use the GPU to it’s full. I’m a beginner on ML with GPUs so I might be screwing up something.

Below is the commands used, logs, screenshots and the sample.

Sample:

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

Logs with GPU execution:

# Device Query
❯ docker run -it --runtime=nvidia --rm jitteam/devicequery ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Xavier"
  CUDA Driver Version / Runtime Version          10.2 / 10.0
  CUDA Capability Major/Minor version number:    7.2
  Total amount of global memory:                 7764 MBytes (8140709888 bytes)
  ( 6) Multiprocessors, ( 64) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            1109 MHz (1.11 GHz)
  Memory Clock rate:                             1109 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS

# Sample run:
❯ docker run -it --runtime=nvidia --rm -v $(pwd):/work -w /work carlosedp/l4t-tensorflow:r32.4.2-tf2-py3 bash -c "time python3 /work/hello-tf.py"
2020-06-05 19:06:35.092322: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-06-05 19:06:37.678757: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libnvinfer.so.7
2020-06-05 19:06:37.681616: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libnvinfer_plugin.so.7
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Train on 60000 samples
2020-06-05 19:06:42.888932: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-06-05 19:06:42.896688: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:42.896971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.109GHz coreCount: 6 deviceMemorySize: 7.58GiB deviceMemoryBandwidth: 66.10GiB/s
2020-06-05 19:06:42.897063: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-06-05 19:06:42.897217: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-06-05 19:06:42.901119: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-06-05 19:06:42.902278: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-06-05 19:06:42.906921: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-06-05 19:06:42.910432: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-06-05 19:06:42.910607: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-06-05 19:06:42.910999: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:42.911347: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:42.911476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-05 19:06:42.934951: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2020-06-05 19:06:42.935906: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x23fc9fa0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-05 19:06:42.935995: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-05 19:06:43.028912: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:43.029441: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x231cc580 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-05 19:06:43.029535: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Xavier, Compute Capability 7.2
2020-06-05 19:06:43.030013: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:43.030160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.109GHz coreCount: 6 deviceMemorySize: 7.58GiB deviceMemoryBandwidth: 66.10GiB/s
2020-06-05 19:06:43.030253: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-06-05 19:06:43.030307: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-06-05 19:06:43.030398: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-06-05 19:06:43.030468: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-06-05 19:06:43.030536: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-06-05 19:06:43.030602: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-06-05 19:06:43.030651: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-06-05 19:06:43.030930: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:43.031242: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:43.031499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-05 19:06:43.031723: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-06-05 19:06:46.944439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-05 19:06:46.944569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0
2020-06-05 19:06:46.944615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N
2020-06-05 19:06:46.945106: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:46.945435: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:46.945629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 125 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
Epoch 1/5
2020-06-05 19:06:47.540354: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
60000/60000 [==============================] - 15s 256us/sample - loss: 0.2929 - acc: 0.9150
Epoch 2/5
60000/60000 [==============================] - 14s 228us/sample - loss: 0.1402 - acc: 0.9587
Epoch 3/5
60000/60000 [==============================] - 14s 229us/sample - loss: 0.1064 - acc: 0.9676
Epoch 4/5
60000/60000 [==============================] - 14s 227us/sample - loss: 0.0877 - acc: 0.9729
Epoch 5/5
60000/60000 [==============================] - 14s 228us/sample - loss: 0.0750 - acc: 0.9770
10000/10000 [==============================] - 1s 149us/sample - loss: 0.0733 - acc: 0.9759

real	1m25.732s
user	1m27.420s
sys	0m11.308s

Logs with CPU execution (without GPU):

# Query device:
❯ docker run -it --rm jitteam/devicequery ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

# Sample run:
❯ docker run -it --rm -v $(pwd):/work -w /work carlosedp/l4t-tensorflow:r32.4.2-tf2-py3 bash -c "time python3 /work/hello-tf.py"
2020-06-05 19:04:43.295703: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.2'; dlerror: libcudart.so.10.2: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib:
2020-06-05 19:04:43.295831: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-06-05 19:04:45.829129: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libnvinfer.so.7'; dlerror: /usr/lib/aarch64-linux-gnu/libnvinfer.so.7: file too short; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib:
2020-06-05 19:04:45.829571: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.7: file too short; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib:
2020-06-05 19:04:45.829702: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Train on 60000 samples
2020-06-05 19:04:51.494518: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: /usr/lib/aarch64-linux-gnu/libcuda.so.1: file too short; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib:
2020-06-05 19:04:51.494658: E tensorflow/stream_executor/cuda/cuda_driver.cc:355] failed call to cuInit: UNKNOWN ERROR (303)
2020-06-05 19:04:51.494824: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (aab2062db9e6): /proc/driver/nvidia/version does not exist
2020-06-05 19:04:51.517571: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2020-06-05 19:04:51.519296: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x32ce5410 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-05 19:04:51.519433: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Epoch 1/5
60000/60000 [==============================] - 11s 187us/sample - loss: 0.2937 - acc: 0.9140
Epoch 2/5
60000/60000 [==============================] - 11s 182us/sample - loss: 0.1446 - acc: 0.9569
Epoch 3/5
60000/60000 [==============================] - 11s 182us/sample - loss: 0.1077 - acc: 0.9675
Epoch 4/5
60000/60000 [==============================] - 11s 182us/sample - loss: 0.0906 - acc: 0.9719
Epoch 5/5
60000/60000 [==============================] - 11s 182us/sample - loss: 0.0771 - acc: 0.9764
10000/10000 [==============================] - 1s 102us/sample - loss: 0.0793 - acc: 0.9773

real	1m6.123s
user	1m27.540s
sys	0m8.412s

Ps. I saw similar execution times by running the same sample with Tensorflow 1 with the container provided by NVidia (nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3)

1 Like

I have no experience with running Keras models on TF, but looking at this article on Stackoverflow: are you sure you have enabled GPU support for Keras correctly?

I think this is a tensorflow vs. tensor-RT thing. TF is optimized for high-precision floating-point numbers, where as RT is optimized for INT. The MNIST dataset is pixel data, I believe, so convolutions etc. can work on 0-255 data values. Or something.

Hi,

Thanks for reporting this to us.
Here are several experiments need you help first.

1. Please help to maximize the XavierNX performance outside of the container first.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2. Could you apply the same performance test on the original tensorflow 1 container also?

Thanks.

Thanks for the answers… I’ll try with @AastaLLL tips and also with a different workload to check if it’s related to that specific MNIST test or it’s a general thing.
I’ll let you know soon!