Jetson AGX tensorrt inference latency very high after upgrade the kernel from 32.7.2 to 32.7.4

My device is Jetson AGX, jetpack 4.6,teensorrt8.2.
My question is why tensorrt runs with very high latency after I upgraded my kernel from 32.7.2 to 32.7.4.

before I upgrade the kernel(32.7.2),
I run the tensorrt inference CPP demo app. I got the below profiler.

[2024-11-25 03:50:59.619] [info] Trt: detection inference BatchCount: 15455
[2024-11-25 03:50:59.619] [info] |_ Preprocess: 0 ms/batch
[2024-11-25 03:50:59.619] [info] |_ CopyInput: 0 ms/batch
[2024-11-25 03:50:59.619] [info] |_ SetInferInputDims: 0 ms/batch
[2024-11-25 03:50:59.619] [info] |_ SetOptimization: 0 ms/batch
[2024-11-25 03:50:59.619] [info] |_ SetBindingDimensions: 0 ms/batch
[2024-11-25 03:50:59.619] [info] |_ Enqueue: 9 ms/batch
[2024-11-25 03:50:59.619] [info] |_ CopyOutput: 0 ms/batch
[2024-11-25 03:50:59.619] [info] |_ Postprocess: 3 ms/batch

after I upgraded the kernel from 32.7.2 to 32.7.4, the same app same input but the latency was very high.
the tensorrt API “context->setBindingDimensions(0, inferInputDims);” is very slow.
why ??? the context->setBindingDimensions(0, inferInputDims);running so slow ?What kind of resources does it take up? What resources does it require to operate?

[2024-11-26 11:30:49.650] [info] Trt: detection inference BatchCount: 13916
[2024-11-26 11:30:49.650] [info] |_ Preprocess: 0 ms/batch
[2024-11-26 11:30:49.650] [info] |_ CopyInput: 3 ms/batch
[2024-11-26 11:30:49.650] [info] |_ SetInferInputDims: 0 ms/batch
[2024-11-26 11:30:49.650] [info] |_ **SetOptimization: 87 ms/batch**
[2024-11-26 11:30:49.650] [info] |_ SetBindingDimensions: 0 ms/batch
[2024-11-26 11:30:49.650] [info] |_ Enqueue: 21 ms/batch
[2024-11-26 11:30:49.650] [info] |_ CopyOutput: 0 ms/batch
[2024-11-26 11:30:49.650] [info] |_ Postprocess: 4 ms/batch

Hi,

Which device do you use?
AGX Orin doesn’t support r32 BSP.

Thanks.

Jetson Xavier AGX

Hi,

Is there any difference in the source you execute?

Both r32.7.2 and r32.7.4 are using TensorRT 8.2.1.
There is no difference in the TensorRT library.

Thanks.

I run the same demo and inputs on the same tensorrt version.
but the kernel version is not the same; I am curious whether the kernel imports some bugs that affect the tensorrt inference.

why ??? the context->setBindingDimensions(0, inferInputDims); running so slow ?What kind of resources does it take up? What resources does it require to operate?

hi, any comments on this?

Hi,

The function is just set up a parameter that should be fast.
Based on the results you provided, we don’t see obvious latency degradation of the SetBindingDimensions.

Could you double-check it?

r32.7.2

[2024-11-25 03:50:59.619] [info] |_ SetBindingDimensions: 0 ms/batch

r32.7.4.

[2024-11-26 11:30:49.650] [info] |_ SetBindingDimensions: 0 ms/batch

Thanks.

thanks for the reply.

sorry, I mean the “context->setOptimizationProfileAsync(0, (*cudaStreamsArray)[threadIndex]);” is much slow on the 32.7.4 than 32.7.2

Hi,

The API is related to optimization.
Based on the document, the function may take time if some resource allocation is required.
https://developer.nvidia.com/docs/drive/drive-os/archives/6.0.4/tensorrt/api-reference/docs/classnvinfer1_1_1_i_execution_context.html#a74c361a3d93e70a3164988df7d60a4cc

This function will trigger layer resource updates on the next call of enqueueV2/executeV2, possibly resulting in performance bottlenecks.

Do you change the input size across different batches?
If not, this function should only be required at the initial phase.

Thanks.