performance of AGX

I’m just getting up and running on the AGX platform and want to make sure things are setup correct and the performance I’m getting from my system is what one would expect. Can someone please run the following mnist test code and compare your results with mine? I’m using mode 0 (sudo nvpmodel -m 0) and setting max clocks (sudo jetson_clocks.sh)

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

Result on AGX:

$ python3 digits.py 
Epoch 1/5
2018-12-26 04:23:04.691017: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] ARM64 does not support NUMA - returning NUMA node zero
2018-12-26 04:23:04.691301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Xavier major: 7 minor: 2 memoryClockRate(GHz): 1.5
pciBusID: 0000:00:00.0
totalMemory: 15.46GiB freeMemory: 10.53GiB
2018-12-26 04:23:04.691758: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-26 04:23:05.355237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-26 04:23:05.355360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-12-26 04:23:05.355441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-12-26 04:23:05.355654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10010 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
60000/60000 [==============================] - 14s 236us/step - loss: 0.2027 - acc: 0.9399
Epoch 2/5
60000/60000 [==============================] - 11s 177us/step - loss: 0.0814 - acc: 0.9749
Epoch 3/5
60000/60000 [==============================] - 11s 176us/step - loss: 0.0514 - acc: 0.9836
Epoch 4/5
60000/60000 [==============================] - 11s 175us/step - loss: 0.0368 - acc: 0.9881
Epoch 5/5
60000/60000 [==============================] - 10s 174us/step - loss: 0.0272 - acc: 0.9913
10000/10000 [==============================] - 1s 93us/step

I ran the same code in a tensorflow docker container on my host 2018 15" macbook pro laptop, and getting better results than this. The docker engine is configured to use only 4 cores (2.2GHz i7) and 8GB of memory.
Result on host (4 cores i7 @2.2GHz and 4GB memory):

Epoch 1/5
60000/60000 [==============================] - 9s 157us/step - loss: 0.2018 - acc: 0.9404
Epoch 2/5
60000/60000 [==============================] - 9s 151us/step - loss: 0.0797 - acc: 0.9754
Epoch 3/5
60000/60000 [==============================] - 9s 151us/step - loss: 0.0535 - acc: 0.9832
Epoch 4/5
60000/60000 [==============================] - 9s 149us/step - loss: 0.0383 - acc: 0.9878
Epoch 5/5
60000/60000 [==============================] - 9s 151us/step - loss: 0.0275 - acc: 0.9911
10000/10000 [==============================] - 0s 40us/step

Someone can step in, but based on some research I thought the Jetson AGX is not intended to be used for ‘fast’ training? A 2018 Macbook Pro should outperform a Jetson AGX on training, but excel at executing inference on already trained/built models for applications with low power requirements.

Well, this example includes both training and inference. If I am reading this correctly, the GPU is doing much poorer on that (93us/step vs 40us/step). And I am not surprised. I would expect the GPU to do ‘relatively’ better on training where batch parallelism is present. Don’t you think?

But in any case, my purpose foremost, is to compare numbers with someone who is confident of their AGX software setup. Just for my sanity check.

Hi,

We have a dedicated performance report for Jetson Xavier here:
https://developer.nvidia.com/embedded/jetson-agx-xavier-dl-inference-benchmarks

Thanks/

I tried running with CPU.

$ env CUDA_VISIBLE_DEVICES="" python3 digits.py
Epoch 1/5
2018-12-26 09:18:48.304785: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2018-12-26 09:18:48.304885: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:150] kernel driver does not appear to be running on this host (jetson-0423218011297): /proc/driver/nvidia/version does not exist
60000/60000 [==============================] - 13s 213us/step - loss: 0.2012 - acc: 0.9410
Epoch 2/5
60000/60000 [==============================] - 12s 202us/step - loss: 0.0820 - acc: 0.9747
Epoch 3/5
60000/60000 [==============================] - 12s 204us/step - loss: 0.0521 - acc: 0.9839
Epoch 4/5
60000/60000 [==============================] - 12s 200us/step - loss: 0.0359 - acc: 0.9885
Epoch 5/5
60000/60000 [==============================] - 12s 200us/step - loss: 0.0268 - acc: 0.9912
10000/10000 [==============================] - 1s 85us/step

The time of model.evaluation() does not seem to change even if it runs on gpu or not.
(Of course, it changes on a big model like xception model.)
This model seems a little small to see GPU performance.

Thanks for checking @naisy.
I guess I’ll try using the benchmarks examples for a better idea

Benchmarks are benchmarks. User generated code like yours @hucqym should be better than any benchmark as far as real world performance?

Based on what @naisy posted, are the results consistent in that you should be doing training on your Macbook Pro 2018 and only performing inference on the AGX Xavier? (because Macbook Pro 2018 outperforms AGX Xavier as far as training goes?)

@hcyqym, let us know what you observe.

Here is a link to an article that does not recommend training anything using the Jetson, but only using it for inferencing.

https://devtalk.nvidia.com/default/topic/964604/caffe-imagenet-train-error-on-tx1/?offset=4

Someone say something if this is no longer applicable due to how much better the AGX Xavier is and if it is actually recommended to perform training in addition to inferencing.