How to boost trtexec's gps for 1DLA only?

firefly1688 · December 6, 2022, 1:02am

Hi, there:

I have an Orin32 box with 275 TOPS, 2 DLA and 12 CPU cores. I’m running with your tool trtexec. I hope to get the best gps with only 1 DLA core, since I need to estimate your future Orin-N performance, which has only 1 DLA. I’m using our custom cnn model.

1) run with 2 DLA cores

With command options such as: --best --useCudaGraph --allowGPUFallback --streams=8, I can achieve over 412 gps. I assume it uses both DLA cores, since I didn’t specify which DLA.

So I’m hoping to get around 412/2=206 for single DLA for the same model.

2) run with 1 DLA only:
When I tried to run with only 1 DLA, the best option is: --allowGPUFallback --useDLACore=0 --int8

But I can only achieve 112 gps, only half of above 206. I tried to use options such as --best, —useCudaGraph and --streams, but they either crashed or got much worse gps.

Why they don’t work for single DLA mode? Any way to boost single DLA inference to half of double-DLA ?

Many thanks.

TomNVIDIA · December 7, 2022, 3:41pm

Hi,

This looks to be a Jetson topic. I am moving to the proper category for visibility.

Best,
Tom

AastaLLL · December 8, 2022, 2:55am

Hi,

There are some misunderstandings about the trtexec.
Please noted that DLA is extra hardware and Orin has one GPU + 2x DLA.

TensorRT infers a model on one process per time.
To run GPU+2DLA, you will need to launch 3x trtexec for GPU, DLA0, and DLA1 respectively.

If no DLA is specified, the inference will run on GPU, which is expected to be faster.
If a DLACore is specified, the inference will run on that DLA.

You can boost GPU performance with the following command (MaxN+max clock).

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

But the boost of DLA is not supported currently.
Thanks.

firefly1688 · December 8, 2022, 9:29am

Thanks. Why run GPU is faster? Mine is cnn model, and I had thought DLA is better suited for it.

firefly1688 · December 9, 2022, 1:52am

I tried the 3 parallel trtexec approache, all last for about 60 seconds.

The trtexec gpu session dropped to gps 272.
Both trtexec DLA sessions dropped to gps 53.5.

The total became 272+53.5*2=379, even worse than gpu alone (412).
Why?

AastaLLL · December 12, 2022, 4:05am

Hi,

DLA has relatively limited compute units.

To run DLA+2DLA with max performance, please apply the following command:

$ sudo nvpmodel -m0
$ sudo jetson_clocks

$ export CUDA_DEVICE_MAX_CONNECTIONS=32

Below is the Orin profiling table for your reference:

Thanks.

firefly1688 · December 12, 2022, 7:32pm

Thanks for the reply. But even with the ```
CUDA_DEVICE_MAX_CONNECTIONS=32


Command below can get 429gps.
/usr/src/tensorrt/bin/trtexec --loadEngine=c3d_best_nocat.engine --best --useCudaGraph --duration=60

Each of commands below can get 104 gps. 
/usr/src/tensorrt/bin/trtexec --loadEngine=c3d_dla.engine --best --allowGPUFallback --useDLACore=0 --duration=60
/usr/src/tensorrt/bin/trtexec --loadEngine=c3d_dla.engine --best --allowGPUFallback --useDLACore=1 --duration=60


When I ran all 3 commands at the same time in different shells
429 -->272
104-->53.5

Total is 272+53.5*3=272+107=379, way worse than 429.

I think the first command uses DLA resources already, so adding two DLA commands make it worse.

AastaLLL · December 13, 2022, 2:40am

Hi,

No. trtexec will only use GPU if no DLA flag is set.

But it’s possible that your model cannot fully run on DLA.
So if all the models require GPU for inference, the switching overhead will decrease the performance.

Thanks.

firefly1688 · December 13, 2022, 7:45am

Even if it’s true, that doesn’t explain why combined gps worse than GPU alone. I don’t see any benefit using DLA cores at all. I’m confused.

firefly1688 · December 13, 2022, 10:49pm

I decided to try resnet50 model today, with input size 352x672.

tests	GPU	DLA	DLA	total	comments
separate session	715	160	160
parallel 3 run1	546	102	101	749	>715: better than GPU, +34
parallel 3 run2	576	84.5	84.5	745	> 715: better than GPU, +30

So 3 sessions are slightly better than gpu only session.

Don’t know why our own model way worse than gpu-only session.

AastaLLL · December 16, 2022, 4:59am

Hi,

The device placement depends on the layer used in your model.
When converting it to TensorRT, there are some log message shows the layers run on DLA and GPU.

Thanks.

firefly1688 · December 21, 2022, 11:00pm

Why even for resnet50, the performance with 3 sessions is so low, only 5% better than gpu-only session? Where is the bottleneck? Mem or Os scheduling? How to improve?

I tried to use nsys profile and nsys-ui, but got lost. Can’t tell where the bottleneck is.

AastaLLL · December 27, 2022, 6:19am

Hi,

Could you test our benchmark source below:

Orin should be able to achieve 6138.84 qps with ResNet50.
Thanks.

system · January 10, 2023, 6:20am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

ramc · April 26, 2023, 2:57pm

@firefly1688
Also check out the DLA github page for samples and resources: Recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.

We have a FAQ page that addresses some common questions that we see developers run into: Deep-Learning-Accelerator-SW/FAQ

Topic		Replies	Views
How does the TRT inference run on both DLA and GPUs? Jetson Orin NX tensorrt , dla	2	926	August 30, 2023
DLA and GPU cores at the same time Jetson AGX Xavier dla	20	10589	October 18, 2021
Jetson Orin: Running DLA and GPU cores at the same time Jetson AGX Orin dla	4	1026	October 19, 2022
How to use GPU+2 * DLA inference model TensorRT	0	244	February 4, 2024
Jetson Orin: All layers pushed to GPU, zero layers on DLA Jetson AGX Orin tensorrt , dla	7	1129	April 26, 2023
DLA / GPU question Jetson AGX Xavier dla	6	1037	October 18, 2021
The Throughput is too slow in Nvidia jetson AGX ORin DLA Jetson AGX Orin cuda , cudnn , dla	4	593	January 31, 2024
Unable to use DLA with TensorRT Jetson AGX Xavier	11	3471	November 8, 2018
How to use 2 DLA for the same model DRIVE AGX Orin General driveos-dl	2	631	August 11, 2023
Getting less throughput while enabling DLAs on Jetson AGX Orin Jetson AGX Orin dla	5	850	February 23, 2023

How to boost trtexec's gps for 1DLA only?

Related topics