[TensorRT] Model inferencing speed reduction on Jetson Xavier AGX using 2 models

Description

Dear forum developers,

I have two models, one is the object detection and the other is segmentation one.
Each of models have different pre-processing because their input shapes are different.

I’ve serialized and quantized into INT8 two models following NVIDIA’s sample code, and ran successfully.
And the inferencing time of INT8 qunatized model was much faster than FP32/16 one.

But the problem is happened when I tried to run those models at the same time.

If the inferencing time of model A is 3 ms and model B is 4 ms in single inferencing mode, the inferencing time are increased to 6 ms(model A) and 10 ms (model B) in simultaneous running mode, respectively.

I don’t know what happend in here.
The pre-processing uses Opencv-CUDA and post-processing uses OpenGL which are using GPUs.
Could it be the problem?
I’ve turned off the pre/post processing then the inferencing time was increased slightly, but still much slower than the single inferencing.

Please give me some advice.

Thanks.

Environment

TensorRT Version: TensorRT 8
GPU Type: NVIDIA Jetson Xavier AGX
Nvidia Driver Version:
CUDA Version: 11.4
CUDNN Version: 8.4
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

I think the Jetson team will be in a better position to help you , so I have moved this topic and added some tags too.

1 Like

Hi,

Could you double-check the CUDA and cuDNN versions?
For Jetson, CUDA 11.4 is not available yet.

If it occupied all the GPU resources to get model A is 3 ms and model B is 4 ms.
When you deploy them concurrently, they need to share the GPU resources and might also induce some switching overhead.
The result of 6 ms(model A) and 10 ms (model B) seems acceptable.

Thanks.

The CUDA and cuDNN versions are not the correct.
(The Jetpack 4.6 is installed on the device)

and then, Can I know the reason why the inferencing speed slowing down about 2 times?

Thanks.

Hi,

As mentioned above.

When two models are concurrently running, they need to share the GPU resources.
So this will increase the inference time.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.