Weird performance issue

eyalhir74 · February 10, 2021, 4:02pm

Hi,
I’ve noticed some weird performance jumps when running a network on the GPU.
I run the retina-9 network using trtexec, on the GPU, and this is what I get:

Blockquote
02/10/2021-17:44:33] [I] Host Latency
[02/10/2021-17:44:33] [I] min: 33.4995 ms (end to end 33.5034 ms)
[02/10/2021-17:44:33] [I] max: 42.0061 ms (end to end 42.0208 ms)
[02/10/2021-17:44:33] [I] mean: 34.3278 ms (end to end 34.3472 ms)
[02/10/2021-17:44:33] [I] median: 33.6477 ms (end to end 33.6599 ms)
[02/10/2021-17:44:33] [I] percentile: 41.9619 ms at 99% (end to end 41.9756 ms at 99%)
[02/10/2021-17:44:33] [I] throughput: 29.1144 qps
[02/10/2021-17:44:33] [I] walltime: 6.86946 s
[02/10/2021-17:44:33] [I] Enqueue Time
[02/10/2021-17:44:33] [I] min: 1.54883 ms
[02/10/2021-17:44:33] [I] max: 3.39941 ms
[02/10/2021-17:44:33] [I] median: 1.87952 ms
[02/10/2021-17:44:33] [I] GPU Compute
[02/10/2021-17:44:33] [I] min: 32.7964 ms
[02/10/2021-17:44:33] [I] max: 41.311 ms
[02/10/2021-17:44:33] [I] mean: 33.6255 ms
[02/10/2021-17:44:33] [I] median: 32.9451 ms
[02/10/2021-17:44:33] [I] percentile: 41.2667 ms at 99%

I don’t understand why the big difference between 32ms and 41ms.
I ran it under nvprof and in the images attached you can see the following:

Image-1 : Two consecutive calls.
Image-2: First one (highlighted) takes 28.907ms
Image-3: Second one (highlighted) takes 37.05ms
Image-4: A TRT function that got called in the first run by trtexec. Takes 1.918ms
Image-5: The same TRT function that got called in the second run by trtexec. Takes 2.64ms
Image-6: First call. See the kernels inside the red box.
Image-7: Second call. See the same kernels as in Image-6… Much longer, with TRT kernels inside it
making the whole process longer…

Any idea?

thanks
Eyal

Image1:

Image 2:

Image 3:

Image 4:

Image 5:

Image 6:

Image 7:

eyalhir74 · February 17, 2021, 5:46am

Hi,
Any idea why this might be happening?

thanks
Eyal

AastaLLL · February 17, 2021, 6:44am

Hi,

Sorry for the late reply.
May I know what is the tasks listed in the 4th column? Is it also a TensorRT thread?

It seems that it occupied much more resource (duration) in the second run(7.77~7.78).
Since Xavier only have one GPU, TensorRT may need to wait for the resource if all the GPU blocks are used.

Thanks.

eyalhir74 · February 17, 2021, 6:52am

Hi @AastaLLL ,
I saw this behavior using our own proprietary .onnx file, therefore I tested it on a publicly available network (the retina-9 network), and it is reproducable.
In the image you can see two consecutive iterations, using trtexec, so there’s nothing special about it - it just runs the network twice. It seems that in one iteration the TRT kernels are shorter and more pressed together, while in the second iteration, the same TRT operations take more time and less tight in the timeline.

Both iterations are running the same network/logic/TRT blocks - so it should have run the same time. Instead I see up to 30-50% difference between different iterations.

There’s nothing else running on the Xavier during the test.

thanks
Eyal

AastaLLL · February 18, 2021, 8:33am

Hi,

Do you know the 4th column stands for?
You can find the function / API name on the right.

My guess is that there are some reformat layer or data transfer jobs along with the inference.
And occupied the GPU resource to make the second run become slower.

Would you mind to share the profiling data with us so we can check it further?

Thanks.

eyalhir74 · February 18, 2021, 11:03am

Hi @AastaLLL ,
Attached is the nvprof output.
I used this onnx file:
cksum retinanet-9.onnx
1139499010 228369343 retinanet-9.onnx

Converted it to .plan file like this:
trtexec --fp16 --onnx=retinanet-9.onnx --saveEngine=retinanet-9.onnx.plan --workspace=22

And ran it:
trtexec --fp16 --loadEngine=retinanet-9.onnx.plan

Usually the following is what I see ( a difference of ~20% between internal runs - 32ms to 40ms)

Blockquote
[02/18/2021-12:52:16] [I] Trace averages of 10 runs:
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 32.7908 ms - Host latency: 33.4944 ms (end to end 33.5041 ms, enqueue 1.65187 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 33.5915 ms - Host latency: 34.2941 ms (end to end 34.3041 ms, enqueue 1.78687 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 32.9689 ms - Host latency: 33.6715 ms (end to end 33.6826 ms, enqueue 1.65769 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 32.775 ms - Host latency: 33.4791 ms (end to end 33.4891 ms, enqueue 1.59862 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 33.6338 ms - Host latency: 34.3363 ms (end to end 34.3466 ms, enqueue 1.7541 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 32.7827 ms - Host latency: 33.4856 ms (end to end 33.497 ms, enqueue 1.66082 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 32.9476 ms - Host latency: 33.6518 ms (end to end 33.664 ms, enqueue 1.6156 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 33.3682 ms - Host latency: 34.0716 ms (end to end 34.0831 ms, enqueue 1.64216 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 32.917 ms - Host latency: 33.6177 ms (end to end 33.6465 ms, enqueue 1.66802 ms)
[02/18/2021-12:52:16] [I] Host Latency
[02/18/2021-12:52:16] [I] min: 33.3569 ms (end to end 33.3652 ms)
[02/18/2021-12:52:16] [I] max: 41.1202 ms (end to end 41.1282 ms)
[02/18/2021-12:52:16] [I] mean: 33.7844 ms (end to end 33.7971 ms)
[02/18/2021-12:52:16] [I] median: 33.4917 ms (end to end 33.5028 ms)
[02/18/2021-12:52:16] [I] percentile: 41.1202 ms at 99% (end to end 41.1282 ms at 99%)
[02/18/2021-12:52:16] [I] throughput: 29.5883 qps
[02/18/2021-12:52:16] [I] walltime: 3.07554 s
[02/18/2021-12:52:16] [I] Enqueue Time
[02/18/2021-12:52:16] [I] min: 1.40381 ms
[02/18/2021-12:52:16] [I] max: 2.14923 ms
[02/18/2021-12:52:16] [I] median: 1.65137 ms
[02/18/2021-12:52:16] [I] GPU Compute
[02/18/2021-12:52:16] [I] min: 32.7086 ms
[02/18/2021-12:52:16] [I] max: 40.4272 ms
[02/18/2021-12:52:16] [I] mean: 33.0824 ms
[02/18/2021-12:52:16] [I] median: 32.7886 ms
[02/18/2021-12:52:16] [I] percentile: 40.4272 ms at 99%
[02/18/2021-12:52:16] [I] total compute time: 3.01049 s

I do see something else - If during the run, I switch to another idle application, using Alt+Tab, the upper bound goes up to +50ms … I guess the fact that Xavier is doing both compute and display is an issue…
However, 32ms to 39ms difference when the only thing running is trtexec, seems high.

retinanet9.nvvp (7.0 MB)

I’m using CUDA 10.2 and Jetpack 4.4

thanks
Eyal

AastaLLL · March 2, 2021, 9:12am

Hi,

Thanks to share the data with us.
We are going to check this issue and will get back to you later.

AastaLLL · March 16, 2021, 12:05pm

Hi,

Sorry that we are still checking this issue.
Will update more information with you later.

eyalhir74 · April 8, 2021, 5:05am

Hi @AastaLLL ,
Any update on this issue?

thanks
Eyal

AastaLLL · April 13, 2021, 4:09pm

Hi,

Sorry for the late update.

A possible reason is that the device runs on dynamic frequency.
So the behavior may be different since it try to minimize the power budge.

Have you maximized and fixed the clock to the highest?
If not, could you give it a try?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

eyalhir74 · April 13, 2021, 6:27pm

Hi @AastaLLL ,
Thanks. Yes I’m running with maximized fixed clock.

thanks
Eyal

AastaLLL · April 20, 2021, 8:25am

Hi,

Sorry for the late.
We have checked your nvvp file but doesn’t observe the situation you mentioned.

In the first launch, the execution takes around 34 ms.

And in the second run, it decrease to around 29 ms.

This is make sense.
Since GPU requires some warm-up time.
You can see the region marked by red box becomes faster in the 2nd run.

Not sure if this is consistent to what you observe.
Please let us know if we have missed something.

Thanks.

Topic		Replies	Views
Jetson Xavier benchmarks mismatch Jetson AGX Xavier	9	1355	October 18, 2021
[E] Error[1]: [genericReformat.cu::executeMemcpy::1334] Error Code 1: Cuda Runtime (invalid argument) Jetson AGX Xavier tensorrt	10	1195	September 19, 2022
Request for trtexec ouput explaination Jetson AGX Xavier tensorrt	5	1814	August 11, 2021
Jetson AGX Xavier Deep Learning Inference Benchmarks Jetson AGX Xavier	17	7653	June 15, 2021
Performance Discrepancy - Python API vs. trtexec on Jetson AGX Orin Board Jetson AGX Orin jetson-inference	8	686	July 10, 2023
Unable to verify Xavier inference benchmarks Jetson AGX Xavier	17	2276	October 18, 2021
Using trtexec fails to convert onnx to tensorrt engine (DLAcore) FP16, but int8 works Jetson Xavier NX dla	7	1223	August 10, 2022
How to verify if QAT TRT engine is indeed INT8 on Xavier Jetson AGX Xavier tensorrt	16	585	October 5, 2022
How to get reliable GPU clock rate on Jetson AGX Xavier? Jetson AGX Xavier tensorrt	2	670	October 18, 2021
Performance difference between Jetpack and TensorRT versions Jetson Nano tensorrt , jetson-inference	7	452	June 21, 2023

Weird performance issue

Related topics