I’m trying to verify classification benchmarks on the Xavier, but I’m unable to replicate the performance numbers that I see posted online. I’ve tried to follow the following source:
I was largely able to follow these instructions, but I had to make the changes listed below in order to get the uff converter to compile:
https://devtalk.nvidia.com/default/topic/1043619/jetson-tf_to_trt_image_classification/?offset=6
After compilation, I ran the following scripts, without modification:
source scripts/download_models.sh
python3 scripts/models_to_frozen_graphs.py
source scripts/download_images.sh
python3 scripts/frozen_graphs_to_plans.py
python3 scripts/test_trt.py
python3 scripts/test_tf.py
Below are the benchmark numbers for tensorflow (data/test_output_tf.txt):
vgg_16 4218.965682983398
inception_v1 15.160059928894043
inception_v2 16.9545841217041
inception_v3 31.412348747253418
inception_v4 57.95839786529541
inception_resnet_v2 70.55327415466309
resnet_v1_50 25.6368350982666
resnet_v1_101 45.75383186340332
resnet_v1_152 60.83596229553223
resnet_v2_50 33.184447288513184
resnet_v2_101 64.34244155883789
resnet_v2_152 84.0024471282959
mobilenet_v1_1p0_224 12.122135162353516
mobilenet_v1_0p5_160 6.786251068115234
mobilenet_v1_0p25_128 7.124357223510742
And for TensorRT (data/test_output_trt.txt):
data/plans/vgg_16.plan 12.1812
data/plans/inception_v1.plan 5.35698
data/plans/inception_v3.plan 22.4136
data/plans/inception_v4.plan 21.4755
data/plans/inception_resnet_v2.plan 23.2827
data/plans/resnet_v2_50.plan 8.40148
data/plans/resnet_v2_101.plan 16.299
data/plans/resnet_v2_152.plan 20.1305
data/plans/mobilenet_v1_1p0_224.plan 6.92651
data/plans/mobilenet_v1_0p5_160.plan 2.98501
data/plans/mobilenet_v1_0p25_128.plan 3.13828
The times are somewhere between 1x and 3x faster than those reported on the TX2 in the GitHub link, which does not seem consistent with the numbers reported below. It also seems that some of the models failed to convert.
I also tried to follow the instructions posted in the link above, which explicitly calls trtexec. I wasn’t able to find the resnet50.prototxt file listed in the link, but there was a googlenet.prototxt provided in /usr/src/tensorrt/data/googlenet/googlenet.prototxt that I tried:
int8 on GPU
./trtexec --avgRuns=100 --deploy=../data/googlenet/googlenet.prototxt --int8 --batch=8 --iterations=10000 --output=prob --useSpinWait
avgRuns: 100
deploy: ../data/googlenet/googlenet.prototxt
int8
batch: 8
iterations: 10000
output: prob
useSpinWait
Input "data": 3x224x224
Output "prob": 20x1x1
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 100 runs is 10.507 ms (host walltime is 10.5863 ms, 99% percentile time is 40.4644).
Average over 100 runs is 6.29326 ms (host walltime is 6.35656 ms, 99% percentile time is 8.84531).
Average over 100 runs is 6.23239 ms (host walltime is 6.2915 ms, 99% percentile time is 8.29283).
fp16 on GPU
./trtexec --avgRuns=100 --deploy=../data/googlenet/googlenet.prototxt --fp16 --batch=8 --iterations=10000 --output=prob --useSpinWait
avgRuns: 100
deploy: ../data/googlenet/googlenet.prototxt
fp16
batch: 8
iterations: 10000
output: prob
useSpinWait
Input "data": 3x224x224
Output "prob": 20x1x1
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 100 runs is 9.11288 ms (host walltime is 9.17181 ms, 99% percentile time is 41.3132).
Average over 100 runs is 8.18154 ms (host walltime is 8.23468 ms, 99% percentile time is 11.0188).
Average over 100 runs is 8.12368 ms (host walltime is 8.17516 ms, 99% percentile time is 11.0971).
fp16 on DLA core 0
./trtexec --avgRuns=100 --deploy=../data/googlenet/googlenet.prototxt --fp16 --batch=8 --iterations=10000 --output=prob --useDLACore=0 --useSpinWait --allowGPUFallback
avgRuns: 100
deploy: ../data/googlenet/googlenet.prototxt
fp16
batch: 8
iterations: 10000
output: prob
useDLACore: 0
useSpinWait
allowGPUFallback
Input "data": 3x224x224
Output "prob": 20x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 100 runs is 28.7429 ms (host walltime is 28.8939 ms, 99% percentile time is 31.6979).
Average over 100 runs is 28.6642 ms (host walltime is 28.8532 ms, 99% percentile time is 30.1394).
Average over 100 runs is 28.5911 ms (host walltime is 28.7823 ms, 99% percentile time is 29.482).
fp16 on DLA core 1
./trtexec --avgRuns=100 --deploy=../data/googlenet/googlenet.prototxt --fp16 --batch=8 --iterations=10000 --output=prob --useDLACore=1 --useSpinWait --allowGPUFallback
avgRuns: 100
deploy: ../data/googlenet/googlenet.prototxt
fp16
batch: 8
iterations: 10000
output: prob
useDLACore: 1
useSpinWait
allowGPUFallback
Input "data": 3x224x224
Output "prob": 20x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 100 runs is 28.6083 ms (host walltime is 28.7687 ms, 99% percentile time is 33.8545).
Average over 100 runs is 28.4075 ms (host walltime is 28.6114 ms, 99% percentile time is 29.0141).
Average over 100 runs is 28.5257 ms (host walltime is 28.6889 ms, 99% percentile time is 30.4743).
The DLA times are slower than the GPU times, even when using FP16 on the GPU. According to the benchmarks provided above, I should be observing approximately 4 ms inference time with batch size 8 on the DLA cores. Based on the output text, it looks like the inference is falling back to the GPU, but that should hopefully not incur a 3x performance penalty.
All benchmarks were taken on an AGX Xavier Devkit with Jetpack 4.1.1 installed in MAX_N mode.
Is there another set of instructions or links that I should be following in order to make use of the DLAs? I want to ensure that I can replicate the provided benchmarks before trying to use the DLAs to perform object detection. I am following the above links due to the suggestions here: