performance of AGX

hucqym · December 26, 2018, 4:50am

I’m just getting up and running on the AGX platform and want to make sure things are setup correct and the performance I’m getting from my system is what one would expect. Can someone please run the following mnist test code and compare your results with mine? I’m using mode 0 (sudo nvpmodel -m 0) and setting max clocks (sudo jetson_clocks.sh)

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

Result on AGX:

$ python3 digits.py 
Epoch 1/5
2018-12-26 04:23:04.691017: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] ARM64 does not support NUMA - returning NUMA node zero
2018-12-26 04:23:04.691301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Xavier major: 7 minor: 2 memoryClockRate(GHz): 1.5
pciBusID: 0000:00:00.0
totalMemory: 15.46GiB freeMemory: 10.53GiB
2018-12-26 04:23:04.691758: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-26 04:23:05.355237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-26 04:23:05.355360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-12-26 04:23:05.355441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-12-26 04:23:05.355654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10010 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
60000/60000 [==============================] - 14s 236us/step - loss: 0.2027 - acc: 0.9399
Epoch 2/5
60000/60000 [==============================] - 11s 177us/step - loss: 0.0814 - acc: 0.9749
Epoch 3/5
60000/60000 [==============================] - 11s 176us/step - loss: 0.0514 - acc: 0.9836
Epoch 4/5
60000/60000 [==============================] - 11s 175us/step - loss: 0.0368 - acc: 0.9881
Epoch 5/5
60000/60000 [==============================] - 10s 174us/step - loss: 0.0272 - acc: 0.9913
10000/10000 [==============================] - 1s 93us/step

I ran the same code in a tensorflow docker container on my host 2018 15" macbook pro laptop, and getting better results than this. The docker engine is configured to use only 4 cores (2.2GHz i7) and 8GB of memory.
Result on host (4 cores i7 @2.2GHz and 4GB memory):

Epoch 1/5
60000/60000 [==============================] - 9s 157us/step - loss: 0.2018 - acc: 0.9404
Epoch 2/5
60000/60000 [==============================] - 9s 151us/step - loss: 0.0797 - acc: 0.9754
Epoch 3/5
60000/60000 [==============================] - 9s 151us/step - loss: 0.0535 - acc: 0.9832
Epoch 4/5
60000/60000 [==============================] - 9s 149us/step - loss: 0.0383 - acc: 0.9878
Epoch 5/5
60000/60000 [==============================] - 9s 151us/step - loss: 0.0275 - acc: 0.9911
10000/10000 [==============================] - 0s 40us/step

asrocki · December 26, 2018, 5:39am

Someone can step in, but based on some research I thought the Jetson AGX is not intended to be used for ‘fast’ training? A 2018 Macbook Pro should outperform a Jetson AGX on training, but excel at executing inference on already trained/built models for applications with low power requirements.

hucqym · December 26, 2018, 6:24am

Well, this example includes both training and inference. If I am reading this correctly, the GPU is doing much poorer on that (93us/step vs 40us/step). And I am not surprised. I would expect the GPU to do ‘relatively’ better on training where batch parallelism is present. Don’t you think?

But in any case, my purpose foremost, is to compare numbers with someone who is confident of their AGX software setup. Just for my sanity check.

AastaLLL · December 26, 2018, 7:55am

Hi,

We have a dedicated performance report for Jetson Xavier here:
[url]https://developer.nvidia.com/embedded/jetson-agx-xavier-dl-inference-benchmarks[/url]

Thanks/

naisy · December 26, 2018, 10:11am

I tried running with CPU.

$ env CUDA_VISIBLE_DEVICES="" python3 digits.py
Epoch 1/5
2018-12-26 09:18:48.304785: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2018-12-26 09:18:48.304885: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:150] kernel driver does not appear to be running on this host (jetson-0423218011297): /proc/driver/nvidia/version does not exist
60000/60000 [==============================] - 13s 213us/step - loss: 0.2012 - acc: 0.9410
Epoch 2/5
60000/60000 [==============================] - 12s 202us/step - loss: 0.0820 - acc: 0.9747
Epoch 3/5
60000/60000 [==============================] - 12s 204us/step - loss: 0.0521 - acc: 0.9839
Epoch 4/5
60000/60000 [==============================] - 12s 200us/step - loss: 0.0359 - acc: 0.9885
Epoch 5/5
60000/60000 [==============================] - 12s 200us/step - loss: 0.0268 - acc: 0.9912
10000/10000 [==============================] - 1s 85us/step

The time of model.evaluation() does not seem to change even if it runs on gpu or not.
(Of course, it changes on a big model like xception model.)
This model seems a little small to see GPU performance.

hucqym · December 26, 2018, 11:08am

Thanks for checking @naisy.
I guess I’ll try using the benchmarks examples for a better idea

asrocki · December 26, 2018, 3:31pm

Benchmarks are benchmarks. User generated code like yours @hucqym should be better than any benchmark as far as real world performance?

Based on what @naisy posted, are the results consistent in that you should be doing training on your Macbook Pro 2018 and only performing inference on the AGX Xavier? (because Macbook Pro 2018 outperforms AGX Xavier as far as training goes?)

asrocki · December 26, 2018, 10:13pm

@hcyqym, let us know what you observe.

Here is a link to an article that does not recommend training anything using the Jetson, but only using it for inferencing.

https://devtalk.nvidia.com/default/topic/964604/caffe-imagenet-train-error-on-tx1/?offset=4

Someone say something if this is no longer applicable due to how much better the AGX Xavier is and if it is actually recommended to perform training in addition to inferencing.

Topic		Replies	Views
Jetson AGX slower than TX2 Jetson AGX Xavier	7	982	May 22, 2019
Jetson AGX Xavier Deep Learning Inference Benchmarks Jetson AGX Xavier	17	7671	June 15, 2021
Slow model loading on a Jetson AGX Xavier with TensorFlow 2.5.0 Jetson AGX Xavier cuda , tensorflow	13	2345	November 10, 2021
Performance improvement on Jetson Nano Jetson Nano tensorflow	6	1555	October 18, 2021
Problem to install tensorflow on Xavier (Solved) Jetson AGX Xavier	19	8659	October 18, 2021
Jetson Xavier NX - Tensorflow 2 container slower on GPU than on CPU Jetson Xavier NX tensorflow	5	2545	October 18, 2021
TensorFlow performance Jetson Nano	2	3107	October 18, 2021
Is my Tensorflow install really uses the GPU? Jetson Nano cuda , tensorflow	2	725	October 18, 2021
ARM64 does not support NUMA - returning NUMA node zero Jetson AGX Xavier tensorflow	6	1465	July 14, 2022
TensorRT ( C++ ) inference strange behavior on Jetson AGX Xavier TensorRT cudnn	0	21	January 15, 2025

performance of AGX

Related topics