Jetson Xavier benchmarks mismatch

Hi,

I am using the Jetson AGX Xavier with the latest JetPack 4.1.1 (TensorRT 5.0)
I was trying to duplicate results with the benchmarks posted on this site:
https://developer.nvidia.com/embedded/jetson-agx-xavier-dl-inference-benchmarks
and found out, I have a gap between the published results and my results.

Can you guide me how to get the same results?


My only interest is in ResNet-50 graph with Batch-size=8.

The published results show:
LATENCY (ms) = 11.2 for 15W Mode
LATENCY (ms) = 6.2 for MAX-N Mode

I assume they used this command:
./trtexec --avgRuns=100 --deploy=resnet50.prototxt --int8 --batch=8 --iterations=10000 --output=prob --useSpinWait

Witch is for GPU only with int8 precision.
(Using DLA is X3 slower with fp16 VS GPU only with fp16)

Please see my ./trtexec output prints using the same command (except --iterations=10):

(15W mode)

avgRuns: 1000
deploy: /home/nvidia/Networks/ResNet-50/deploy.prototxt
int8
batch: 8
iterations: 10
output: prob
useSpinWait
Input “data”: 3x224x224
Output “prob”: 1000x1x1

name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 1000 runs is 14.3147 ms (host walltime is 14.3454 ms, 99% percentile time is 14.3826).
Average over 1000 runs is 14.2869 ms (host walltime is 14.3124 ms, 99% percentile time is 14.3984).
Average over 1000 runs is 14.2821 ms (host walltime is 14.308 ms, 99% percentile time is 14.3534).

(MAX-N Mode)

avgRuns: 100
deploy: /home/nvidia/Networks/ResNet-50/deploy.prototxt
int8
batch: 8
iterations: 10
output: prob
useSpinWait
Input “data”: 3x224x224
Output “prob”: 1000x1x1

name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 100 runs is 9.6837 ms (host walltime is 9.69914 ms, 99% percentile time is 33.8719).
Average over 100 runs is 7.48239 ms (host walltime is 7.49908 ms, 99% percentile time is 8.92989).
Average over 100 runs is 7.49587 ms (host walltime is 7.50919 ms, 99% percentile time is 8.79376).
Average over 100 runs is 7.47715 ms (host walltime is 7.49505 ms, 99% percentile time is 8.53834).

Any idea why the gap between published benchmarks and mine?

Moving to Jetson AGX Xavier devtalk for support coverage.

Hi tavorbental, as mentioned on the page, the benchmark results report the cumulative performance from the concurrent use of GPU (INT8) and two DLAs (FP16). You can launch three instances of trtexec simultaneously, with one instance running per device, as seen in the example commands here:

https://developer.nvidia.com/embedded/jetson-agx-xavier-dl-inference-benchmarks#trtexec

Hi dusty,

Thank you for the answer,
but can you please be more specific about how you cumulate the performance?

When I run the trtexec simultaneously (MAXN mode):

int8
batch: 6
iterations: 10
output: prob
useSpinWait
Input “data”: 3x224x224
Output “prob”: 1000x1x1

Average over 1000 runs is 6.90604 ms (host walltime is 6.99801 ms, 99% percentile time is 8.89805).

fp16
batch: 1
iterations: 10
output: prob
useSpinWait
useDLACore: 0
allowGPUFallback
Input “data”: 3x224x224
Output “prob”: 1000x1x1

Average over 1000 runs is 7.66789 ms (host walltime is 8.55767 ms, 99% percentile time is 8.84243).

fp16
batch: 1
iterations: 10
output: prob
useSpinWait
useDLACore: 1
allowGPUFallback
Input “data”: 3x224x224
Output “prob”: 1000x1x1

Average over 1000 runs is 7.65092 ms (host walltime is 8.47453 ms, 99% percentile time is 8.77722).

So I had a batch=6 for the GPU and 2x batch=1 for each of the DLA’s.
Because they all run simultaneously - I think the slower one is take into account.
So it’s reasonable for me that Resnet50 batch=8 will have 7.65092 ms

Only when I run the GPU with int8 and batch=6 - I am able to get the same performance.

int8
batch: 6
iterations: 10
output: prob
useSpinWait
Input “data”: 3x224x224
Output “prob”: 1000x1x1

Average over 1000 runs is 6.23396 ms (host walltime is 6.27553 ms, 99% percentile time is 8.75008).

What did I miss?

Thanks,
Bental

Hi Bental, run all 3 devices with batch size 8, as if there is an asynchronous queue of images coming in that need processed. Idea is to measure the sustained throughput.

Calculate the images per second from each trt-exec instance by taking 1000 / latency × batchSize. Then add these these 3 figures together to get the cumulative images per second processed by the system.

To get the average latency of the system, take 1000 / cumulative images per second × batchSize.

Hi, we’ve been running a similar benchmark and to get reasonable results we had to perform measurements by using a larger number of images. Have a look at our report for computer vision scenario: http://www.bytelake.com/en/nvidia-movidius-comparison
Anyway, let’s get in touch, we specialize in NVIDIA architectures.

Hello,

I am regenerating numbers for Resnet-50.
My doubt is,

How to pick 1 latency value from whole trtexec output?
I mean which value to consider?

-Average of all latencies? (115.195+116.495+… 80.447/num of iterations)
-Last latency value among 10000 iterations? (80.4473 ms)
-or least latency value among 10000 iterations? (80.4473 ms)

Average over 100 runs is 115.195 ms (host walltime is 117.162 ms, 99% percentile time is 123.003).
Average over 100 runs is 116.495 ms (host walltime is 118.372 ms, 99% percentile time is 121.337).
Average over 100 runs is 116.474 ms (host walltime is 118.403 ms, 99% percentile time is 120.689).
Average over 100 runs is 116.56 ms (host walltime is 118.522 ms, 99% percentile time is 121.085).
Average over 100 runs is 116.542 ms (host walltime is 118.433 ms, 99% percentile time is 121.146).
Average over 100 runs is 116.459 ms (host walltime is 118.423 ms, 99% percentile time is 120.822).
Average over 100 runs is 117.279 ms (host walltime is 119.162 ms, 99% percentile time is 133.677).
Average over 100 runs is 94.3713 ms (host walltime is 95.3969 ms, 99% percentile time is 122.374).
Average over 100 runs is 89.473 ms (host walltime is 90.1063 ms, 99% percentile time is 107.742).
Average over 100 runs is 87.3919 ms (host walltime is 88.4426 ms, 99% percentile time is 106.499).
Average over 100 runs is 88.3755 ms (host walltime is 89.69 ms, 99% percentile time is 100.572).
Average over 100 runs is 87.3318 ms (host walltime is 88.034 ms, 99% percentile time is 102.486).
Average over 100 runs is 86.0923 ms (host walltime is 87.2495 ms, 99% percentile time is 92.6218).
Average over 100 runs is 84.7149 ms (host walltime is 85.9822 ms, 99% percentile time is 97.3916).
Average over 100 runs is 81.712 ms (host walltime is 82.9645 ms, 99% percentile time is 83.3956).
Average over 100 runs is 85.1401 ms (host walltime is 86.3638 ms, 99% percentile time is 95.7614).
Average over 100 runs is 87.6545 ms (host walltime is 88.9401 ms, 99% percentile time is 95.2939).
Average over 100 runs is 88.959 ms (host walltime is 90.2963 ms, 99% percentile time is 100.425).
Average over 100 runs is 85.7611 ms (host walltime is 87.0619 ms, 99% percentile time is 96.8633).
Average over 100 runs is 80.4473 ms (host walltime is 81.6053 ms, 99% percentile time is 91.5625).

one more surprising observation is, in case of Resnet-50 and Googlenet, DLAs have almost same latency with concurrency(GPU+DLA0+DLA1) or running alone(DLA0).
While in case of Mobilenet and MobilenetSSD, increase in latency is observed in case of concurrent runs which is logical. Has anybody observed similar thing? Any reason behind it?

Many thanks in advance…

Hi,

It’s known that GPU has a longer launch time at the beginning.
So it’s recommended to use the average value without initial few run.
Like average for the 6st to the last one will be good.

For the second question, some operation in MobileNet is not supported by the DLA.
So it will fallback the implementation into GPU, which introduces some latency.

Thanks.