I have an Orin32 box with 275 TOPS, 2 DLA and 12 CPU cores. I’m running with your tool trtexec. I hope to get the best gps with only 1 DLA core, since I need to estimate your future Orin-N performance, which has only 1 DLA. I’m using our custom cnn model.
1) run with 2 DLA cores
With command options such as: --best --useCudaGraph --allowGPUFallback --streams=8, I can achieve over 412 gps. I assume it uses both DLA cores, since I didn’t specify which DLA.
So I’m hoping to get around 412/2=206 for single DLA for the same model.
2) run with 1 DLA only:
When I tried to run with only 1 DLA, the best option is: --allowGPUFallback --useDLACore=0 --int8
But I can only achieve 112 gps, only half of above 206. I tried to use options such as --best, —useCudaGraph and --streams, but they either crashed or got much worse gps.
Why they don’t work for single DLA mode? Any way to boost single DLA inference to half of double-DLA ?
Thanks for the reply. But even with the ```
CUDA_DEVICE_MAX_CONNECTIONS=32
Command below can get 429gps.
/usr/src/tensorrt/bin/trtexec --loadEngine=c3d_best_nocat.engine --best --useCudaGraph --duration=60
Each of commands below can get 104 gps.
/usr/src/tensorrt/bin/trtexec --loadEngine=c3d_dla.engine --best --allowGPUFallback --useDLACore=0 --duration=60
/usr/src/tensorrt/bin/trtexec --loadEngine=c3d_dla.engine --best --allowGPUFallback --useDLACore=1 --duration=60
When I ran all 3 commands at the same time in different shells
429 -->272
104-->53.5
Total is 272+53.5*3=272+107=379, way worse than 429.
I think the first command uses DLA resources already, so adding two DLA commands make it worse.
No. trtexec will only use GPU if no DLA flag is set.
But it’s possible that your model cannot fully run on DLA.
So if all the models require GPU for inference, the switching overhead will decrease the performance.
The device placement depends on the layer used in your model.
When converting it to TensorRT, there are some log message shows the layers run on DLA and GPU.
Why even for resnet50, the performance with 3 sessions is so low, only 5% better than gpu-only session? Where is the bottleneck? Mem or Os scheduling? How to improve?
I tried to use nsys profile and nsys-ui, but got lost. Can’t tell where the bottleneck is.