Weird performance issue

Hi,
I’ve noticed some weird performance jumps when running a network on the GPU.
I run the retina-9 network using trtexec, on the GPU, and this is what I get:

Blockquote
02/10/2021-17:44:33] [I] Host Latency
[02/10/2021-17:44:33] [I] min: 33.4995 ms (end to end 33.5034 ms)
[02/10/2021-17:44:33] [I] max: 42.0061 ms (end to end 42.0208 ms)
[02/10/2021-17:44:33] [I] mean: 34.3278 ms (end to end 34.3472 ms)
[02/10/2021-17:44:33] [I] median: 33.6477 ms (end to end 33.6599 ms)
[02/10/2021-17:44:33] [I] percentile: 41.9619 ms at 99% (end to end 41.9756 ms at 99%)
[02/10/2021-17:44:33] [I] throughput: 29.1144 qps
[02/10/2021-17:44:33] [I] walltime: 6.86946 s
[02/10/2021-17:44:33] [I] Enqueue Time
[02/10/2021-17:44:33] [I] min: 1.54883 ms
[02/10/2021-17:44:33] [I] max: 3.39941 ms
[02/10/2021-17:44:33] [I] median: 1.87952 ms
[02/10/2021-17:44:33] [I] GPU Compute
[02/10/2021-17:44:33] [I] min: 32.7964 ms
[02/10/2021-17:44:33] [I] max: 41.311 ms
[02/10/2021-17:44:33] [I] mean: 33.6255 ms
[02/10/2021-17:44:33] [I] median: 32.9451 ms
[02/10/2021-17:44:33] [I] percentile: 41.2667 ms at 99%

I don’t understand why the big difference between 32ms and 41ms.
I ran it under nvprof and in the images attached you can see the following:

  • Image-1 : Two consecutive calls.
  • Image-2: First one (highlighted) takes 28.907ms
  • Image-3: Second one (highlighted) takes 37.05ms
  • Image-4: A TRT function that got called in the first run by trtexec. Takes 1.918ms
  • Image-5: The same TRT function that got called in the second run by trtexec. Takes 2.64ms
  • Image-6: First call. See the kernels inside the red box.
  • Image-7: Second call. See the same kernels as in Image-6… Much longer, with TRT kernels inside it
    making the whole process longer…

Any idea?

thanks
Eyal

Image1:

Image 2:

Image 3:

Image 4:

Image 5:

Image 6:

Image 7:

Hi,
Any idea why this might be happening?

thanks
Eyal

Hi,

Sorry for the late reply.
May I know what is the tasks listed in the 4th column? Is it also a TensorRT thread?

It seems that it occupied much more resource (duration) in the second run(7.77~7.78).
Since Xavier only have one GPU, TensorRT may need to wait for the resource if all the GPU blocks are used.

Thanks.

Hi @AastaLLL ,
I saw this behavior using our own proprietary .onnx file, therefore I tested it on a publicly available network (the retina-9 network), and it is reproducable.
In the image you can see two consecutive iterations, using trtexec, so there’s nothing special about it - it just runs the network twice. It seems that in one iteration the TRT kernels are shorter and more pressed together, while in the second iteration, the same TRT operations take more time and less tight in the timeline.

Both iterations are running the same network/logic/TRT blocks - so it should have run the same time. Instead I see up to 30-50% difference between different iterations.

There’s nothing else running on the Xavier during the test.

thanks
Eyal

Hi,

Do you know the 4th column stands for?
You can find the function / API name on the right.

My guess is that there are some reformat layer or data transfer jobs along with the inference.
And occupied the GPU resource to make the second run become slower.

Would you mind to share the profiling data with us so we can check it further?

Thanks.

Hi @AastaLLL ,
Attached is the nvprof output.
I used this onnx file:
cksum retinanet-9.onnx
1139499010 228369343 retinanet-9.onnx

Converted it to .plan file like this:
trtexec --fp16 --onnx=retinanet-9.onnx --saveEngine=retinanet-9.onnx.plan --workspace=22

And ran it:
trtexec --fp16 --loadEngine=retinanet-9.onnx.plan

Usually the following is what I see ( a difference of ~20% between internal runs - 32ms to 40ms)

Blockquote
[02/18/2021-12:52:16] [I] Trace averages of 10 runs:
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 32.7908 ms - Host latency: 33.4944 ms (end to end 33.5041 ms, enqueue 1.65187 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 33.5915 ms - Host latency: 34.2941 ms (end to end 34.3041 ms, enqueue 1.78687 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 32.9689 ms - Host latency: 33.6715 ms (end to end 33.6826 ms, enqueue 1.65769 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 32.775 ms - Host latency: 33.4791 ms (end to end 33.4891 ms, enqueue 1.59862 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 33.6338 ms - Host latency: 34.3363 ms (end to end 34.3466 ms, enqueue 1.7541 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 32.7827 ms - Host latency: 33.4856 ms (end to end 33.497 ms, enqueue 1.66082 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 32.9476 ms - Host latency: 33.6518 ms (end to end 33.664 ms, enqueue 1.6156 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 33.3682 ms - Host latency: 34.0716 ms (end to end 34.0831 ms, enqueue 1.64216 ms)
[02/18/2021-12:52:16] [I] Average on 10 runs - GPU latency: 32.917 ms - Host latency: 33.6177 ms (end to end 33.6465 ms, enqueue 1.66802 ms)
[02/18/2021-12:52:16] [I] Host Latency
[02/18/2021-12:52:16] [I] min: 33.3569 ms (end to end 33.3652 ms)
[02/18/2021-12:52:16] [I] max: 41.1202 ms (end to end 41.1282 ms)
[02/18/2021-12:52:16] [I] mean: 33.7844 ms (end to end 33.7971 ms)
[02/18/2021-12:52:16] [I] median: 33.4917 ms (end to end 33.5028 ms)
[02/18/2021-12:52:16] [I] percentile: 41.1202 ms at 99% (end to end 41.1282 ms at 99%)
[02/18/2021-12:52:16] [I] throughput: 29.5883 qps
[02/18/2021-12:52:16] [I] walltime: 3.07554 s
[02/18/2021-12:52:16] [I] Enqueue Time
[02/18/2021-12:52:16] [I] min: 1.40381 ms
[02/18/2021-12:52:16] [I] max: 2.14923 ms
[02/18/2021-12:52:16] [I] median: 1.65137 ms
[02/18/2021-12:52:16] [I] GPU Compute
[02/18/2021-12:52:16] [I] min: 32.7086 ms
[02/18/2021-12:52:16] [I] max: 40.4272 ms
[02/18/2021-12:52:16] [I] mean: 33.0824 ms
[02/18/2021-12:52:16] [I] median: 32.7886 ms
[02/18/2021-12:52:16] [I] percentile: 40.4272 ms at 99%
[02/18/2021-12:52:16] [I] total compute time: 3.01049 s

I do see something else - If during the run, I switch to another idle application, using Alt+Tab, the upper bound goes up to +50ms … I guess the fact that Xavier is doing both compute and display is an issue…
However, 32ms to 39ms difference when the only thing running is trtexec, seems high.

retinanet9.nvvp (7.0 MB)

I’m using CUDA 10.2 and Jetpack 4.4

thanks
Eyal

Hi,

Thanks to share the data with us.
We are going to check this issue and will get back to you later.

1 Like

Hi,

Sorry that we are still checking this issue.
Will update more information with you later.

1 Like

Hi @AastaLLL ,
Any update on this issue?

thanks
Eyal

Hi,

Sorry for the late update.

A possible reason is that the device runs on dynamic frequency.
So the behavior may be different since it try to minimize the power budge.

Have you maximized and fixed the clock to the highest?
If not, could you give it a try?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Hi @AastaLLL ,
Thanks. Yes I’m running with maximized fixed clock.

thanks
Eyal

Hi,

Sorry for the late.
We have checked your nvvp file but doesn’t observe the situation you mentioned.

In the first launch, the execution takes around 34 ms.

And in the second run, it decrease to around 29 ms.

This is make sense.
Since GPU requires some warm-up time.
You can see the region marked by red box becomes faster in the 2nd run.

Not sure if this is consistent to what you observe.
Please let us know if we have missed something.

Thanks.