Yolov3 on Xavier is slower than on Xavier NX

Description

Hello,

I have transferred a Yolov3(ONNX) model to Xavier(Jetpack4.2.2) and Xavier NX(Jetpack4.4), but the model running on XavierNX is faster than on Xavier.

After some test, I have found that on Xavier a function named,
voidcuPointwise::launchPointwise<cuPointwise::SimpleAlgo<char, int>>(cuPointwise::LaunchParams, nvinfer1::VirtualMachineProgram) occupied the most time.

But on XavierNX this function hasn’t been invoked.

I also use another model to test, HigherHRNet(ONNX), but this will not call voidcuPointwise::launchPointwise<cuPointwise::SimpleAlgo<char, int>>(cuPointwise::LaunchParams, nvinfer1::VirtualMachineProgram) on Xavier.

Any ideas?

Environment

TensorRT Version: Xavier: TensorRT5.1, XavierNX: TensorRT7.1
CUDA Version: Xavier: 10.0, XavierNX: 10.2
CUDNN Version: Xavier: 7.5, XavierNX: 8.0.0

Relevant Files

Xavier

XavierNX

Steps To Reproduce

I test with trtexec and the command is:
Xavier:

./trtexec --onnx=/home/ets/Documents/yolov3/yolov3_bn16_m.onnx  --loadEngine=/home/ets/Documents/yolov3/yolov3_bn16_int8_m.engine --workspace=4096 --int8 --fp16 --batch=16 

XavierNX:

./trtexec --onnx=/home/ets/Documents/yolov3/yolov3_bn16_m.onnx --loadEngine=/home/ets/Documents/yolov3/yolov3_bn16_in8t.engine --explicitBatch --workspace=4096 --fp16 --int8 --batch=16 --verbose

Hi @lingchao.zhu,
Jetson Xavier team will be able to help you better here, hence moving your query to the respective forum.
Thanks!

Hi,

There are always some performance improvement within each TensorRT package release.
So the improvement in NX may come from newer TensorRT API.

Would you mind to run the same test with Xavier + JetPack4.4 first?

Thanks.