Hi everyone, this week I received my Jetson Xavier NX developer board and started playing a bit with it.
I found-out that NVidia provides a Docker image based on L4T with Tensorflow 1 installed. I used it’s Dockerfile and created a similar container with Tensorflow 2. The new Dockerfile is here and the image on Dockerhub with tag carlosedp/l4t-tensorflow:r32.4.2-tf1-py3
.
While testing it with Tensorflow “hello world” sample below from Tensorflow site, I found out two things:
- The time to run the sample with and without the GPU (I’ve enabled or disabled the nvidia-runtime on Docker), the runtimes are pretty much equal or slower on GPU.
- By using the
jtop
utility, the GPU never goes above 35% and the frequency stays at the minimum 114MHz (seen on jtop).
I wonder why the GPU is almost similar to the CPU times and why the runtime doesn’t use the GPU to it’s full. I’m a beginner on ML with GPUs so I might be screwing up something.
Below is the commands used, logs, screenshots and the sample.
Sample:
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)
Logs with GPU execution:
# Device Query
❯ docker run -it --runtime=nvidia --rm jitteam/devicequery ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Xavier"
CUDA Driver Version / Runtime Version 10.2 / 10.0
CUDA Capability Major/Minor version number: 7.2
Total amount of global memory: 7764 MBytes (8140709888 bytes)
( 6) Multiprocessors, ( 64) CUDA Cores/MP: 384 CUDA Cores
GPU Max Clock rate: 1109 MHz (1.11 GHz)
Memory Clock rate: 1109 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS
# Sample run:
❯ docker run -it --runtime=nvidia --rm -v $(pwd):/work -w /work carlosedp/l4t-tensorflow:r32.4.2-tf2-py3 bash -c "time python3 /work/hello-tf.py"
2020-06-05 19:06:35.092322: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-06-05 19:06:37.678757: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libnvinfer.so.7
2020-06-05 19:06:37.681616: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libnvinfer_plugin.so.7
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Train on 60000 samples
2020-06-05 19:06:42.888932: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-06-05 19:06:42.896688: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:42.896971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.109GHz coreCount: 6 deviceMemorySize: 7.58GiB deviceMemoryBandwidth: 66.10GiB/s
2020-06-05 19:06:42.897063: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-06-05 19:06:42.897217: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-06-05 19:06:42.901119: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-06-05 19:06:42.902278: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-06-05 19:06:42.906921: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-06-05 19:06:42.910432: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-06-05 19:06:42.910607: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-06-05 19:06:42.910999: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:42.911347: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:42.911476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-05 19:06:42.934951: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2020-06-05 19:06:42.935906: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x23fc9fa0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-05 19:06:42.935995: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-06-05 19:06:43.028912: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:43.029441: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x231cc580 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-05 19:06:43.029535: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Xavier, Compute Capability 7.2
2020-06-05 19:06:43.030013: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:43.030160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.109GHz coreCount: 6 deviceMemorySize: 7.58GiB deviceMemoryBandwidth: 66.10GiB/s
2020-06-05 19:06:43.030253: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-06-05 19:06:43.030307: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-06-05 19:06:43.030398: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-06-05 19:06:43.030468: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-06-05 19:06:43.030536: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-06-05 19:06:43.030602: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-06-05 19:06:43.030651: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-06-05 19:06:43.030930: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:43.031242: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:43.031499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-05 19:06:43.031723: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-06-05 19:06:46.944439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-05 19:06:46.944569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-06-05 19:06:46.944615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-06-05 19:06:46.945106: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:46.945435: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-06-05 19:06:46.945629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 125 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
Epoch 1/5
2020-06-05 19:06:47.540354: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
60000/60000 [==============================] - 15s 256us/sample - loss: 0.2929 - acc: 0.9150
Epoch 2/5
60000/60000 [==============================] - 14s 228us/sample - loss: 0.1402 - acc: 0.9587
Epoch 3/5
60000/60000 [==============================] - 14s 229us/sample - loss: 0.1064 - acc: 0.9676
Epoch 4/5
60000/60000 [==============================] - 14s 227us/sample - loss: 0.0877 - acc: 0.9729
Epoch 5/5
60000/60000 [==============================] - 14s 228us/sample - loss: 0.0750 - acc: 0.9770
10000/10000 [==============================] - 1s 149us/sample - loss: 0.0733 - acc: 0.9759
real 1m25.732s
user 1m27.420s
sys 0m11.308s
Logs with CPU execution (without GPU):
# Query device:
❯ docker run -it --rm jitteam/devicequery ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
# Sample run:
❯ docker run -it --rm -v $(pwd):/work -w /work carlosedp/l4t-tensorflow:r32.4.2-tf2-py3 bash -c "time python3 /work/hello-tf.py"
2020-06-05 19:04:43.295703: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.2'; dlerror: libcudart.so.10.2: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib:
2020-06-05 19:04:43.295831: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-06-05 19:04:45.829129: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libnvinfer.so.7'; dlerror: /usr/lib/aarch64-linux-gnu/libnvinfer.so.7: file too short; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib:
2020-06-05 19:04:45.829571: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.7: file too short; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib:
2020-06-05 19:04:45.829702: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Train on 60000 samples
2020-06-05 19:04:51.494518: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: /usr/lib/aarch64-linux-gnu/libcuda.so.1: file too short; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib:
2020-06-05 19:04:51.494658: E tensorflow/stream_executor/cuda/cuda_driver.cc:355] failed call to cuInit: UNKNOWN ERROR (303)
2020-06-05 19:04:51.494824: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (aab2062db9e6): /proc/driver/nvidia/version does not exist
2020-06-05 19:04:51.517571: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2020-06-05 19:04:51.519296: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x32ce5410 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-05 19:04:51.519433: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Epoch 1/5
60000/60000 [==============================] - 11s 187us/sample - loss: 0.2937 - acc: 0.9140
Epoch 2/5
60000/60000 [==============================] - 11s 182us/sample - loss: 0.1446 - acc: 0.9569
Epoch 3/5
60000/60000 [==============================] - 11s 182us/sample - loss: 0.1077 - acc: 0.9675
Epoch 4/5
60000/60000 [==============================] - 11s 182us/sample - loss: 0.0906 - acc: 0.9719
Epoch 5/5
60000/60000 [==============================] - 11s 182us/sample - loss: 0.0771 - acc: 0.9764
10000/10000 [==============================] - 1s 102us/sample - loss: 0.0793 - acc: 0.9773
real 1m6.123s
user 1m27.540s
sys 0m8.412s
Ps. I saw similar execution times by running the same sample with Tensorflow 1 with the container provided by NVidia (nvcr.io/nvidia/l4t-tensorflow:r32.4.2-tf1.15-py3
)