How to boost trtexec's gps for 1DLA only?

Hi, there:

I have an Orin32 box with 275 TOPS, 2 DLA and 12 CPU cores. I’m running with your tool trtexec. I hope to get the best gps with only 1 DLA core, since I need to estimate your future Orin-N performance, which has only 1 DLA. I’m using our custom cnn model.

1) run with 2 DLA cores

With command options such as: --best --useCudaGraph --allowGPUFallback --streams=8, I can achieve over 412 gps. I assume it uses both DLA cores, since I didn’t specify which DLA.

So I’m hoping to get around 412/2=206 for single DLA for the same model.

2) run with 1 DLA only:
When I tried to run with only 1 DLA, the best option is: --allowGPUFallback --useDLACore=0 --int8

But I can only achieve 112 gps, only half of above 206. I tried to use options such as --best, —useCudaGraph and --streams, but they either crashed or got much worse gps.

Why they don’t work for single DLA mode? Any way to boost single DLA inference to half of double-DLA ?

Many thanks.

Hi,

This looks to be a Jetson topic. I am moving to the proper category for visibility.

Best,
Tom

Hi,

There are some misunderstandings about the trtexec.
Please noted that DLA is extra hardware and Orin has one GPU + 2x DLA.

TensorRT infers a model on one process per time.
To run GPU+2DLA, you will need to launch 3x trtexec for GPU, DLA0, and DLA1 respectively.

If no DLA is specified, the inference will run on GPU, which is expected to be faster.
If a DLACore is specified, the inference will run on that DLA.

You can boost GPU performance with the following command (MaxN+max clock).

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

But the boost of DLA is not supported currently.
Thanks.

Thanks. Why run GPU is faster? Mine is cnn model, and I had thought DLA is better suited for it.

I tried the 3 parallel trtexec approache, all last for about 60 seconds.

The trtexec gpu session dropped to gps 272.
Both trtexec DLA sessions dropped to gps 53.5.

The total became 272+53.5*2=379, even worse than gpu alone (412).
Why?

Hi,

DLA has relatively limited compute units.

To run DLA+2DLA with max performance, please apply the following command:

$ sudo nvpmodel -m0
$ sudo jetson_clocks
$ export CUDA_DEVICE_MAX_CONNECTIONS=32

Below is the Orin profiling table for your reference:

Thanks.

Thanks for the reply. But even with the ```
CUDA_DEVICE_MAX_CONNECTIONS=32


Command below can get 429gps.
/usr/src/tensorrt/bin/trtexec --loadEngine=c3d_best_nocat.engine --best --useCudaGraph --duration=60

Each of commands below can get 104 gps. 
/usr/src/tensorrt/bin/trtexec --loadEngine=c3d_dla.engine --best --allowGPUFallback --useDLACore=0 --duration=60
/usr/src/tensorrt/bin/trtexec --loadEngine=c3d_dla.engine --best --allowGPUFallback --useDLACore=1 --duration=60


When I ran all 3 commands at the same time in different shells
429 -->272
104-->53.5

Total is 272+53.5*3=272+107=379, way worse than 429.

I think the first command uses DLA resources already, so adding two DLA commands make it worse.

Hi,

No. trtexec will only use GPU if no DLA flag is set.

But it’s possible that your model cannot fully run on DLA.
So if all the models require GPU for inference, the switching overhead will decrease the performance.

Thanks.

Even if it’s true, that doesn’t explain why combined gps worse than GPU alone. I don’t see any benefit using DLA cores at all. I’m confused.

I decided to try resnet50 model today, with input size 352x672.

tests GPU DLA DLA total comments
separate session 715 160 160
parallel 3 run1 546 102 101 749 >715: better than GPU, +34
parallel 3 run2 576 84.5 84.5 745 > 715: better than GPU, +30

So 3 sessions are slightly better than gpu only session.

Don’t know why our own model way worse than gpu-only session.

Hi,

The device placement depends on the layer used in your model.
When converting it to TensorRT, there are some log message shows the layers run on DLA and GPU.

Thanks.

Why even for resnet50, the performance with 3 sessions is so low, only 5% better than gpu-only session? Where is the bottleneck? Mem or Os scheduling? How to improve?

I tried to use nsys profile and nsys-ui, but got lost. Can’t tell where the bottleneck is.

Hi,

Could you test our benchmark source below:

Orin should be able to achieve 6138.84 qps with ResNet50.
Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.