DLA / GPU question

eyalhir74 · May 6, 2020, 6:39am

Hi,
I have a few questions about the DLA / GPU feature.
If I understood correctly when I use the IBuilder’s setDefaultDeviceType I either make the network run on the DLA, which is the actual tensor cores special hardware, or the GPU, which is the “regular” CUDA cores.
Is that right?

Suppose I want to run three different networks at the same time, can I create three different builders/execution engines on DLA 0, DLA 1 and GPU and have them run at the same time? on different CUDA streams?
If its possible, how do I set this? because I see the setDLACore is a IRuntime method. Can I have three different IRuntime objects? how do I connect each network to a different runtime/hardware?

Hope this makes sense :)

thanks
Eyal

AastaLLL · May 6, 2020, 7:09am

Hi,

1. You can check the spec below. DLA is an extra hardware, not indicates GPU.
https://developer.nvidia.com/embedded/jetson-agx-xavier-developer-kit

2. Yes, you can run DLA0, DLA1 and GPU at the same time.

3. Just create 3 independent TensorRT will be fine.

Thanks.

eyalhir74 · May 6, 2020, 7:13am

Thanks.
Regarding 2 - so I might see a performance of X3 (or whatever) compared to running serially on only one of them?
Regarding 3 - what do you mean create 3 independent TensorRT? what objects? in the same application?

BTW - Since you’ve answered so much of my questions, I’d be happy to know your name so I can thank you more personally :)

thanks
Eyal

AastaLLL · May 7, 2020, 2:02am

Hi,

It’s expected that DLA run slower than GPU and the complexity is also smaller.
In the same application, you can create multiple TensorRT engines like this:

ICudaEngine* engine1 = builder->buildEngineWithConfig(*network1, *config);
ICudaEngine* engine2 = builder->buildEngineWithConfig(*network2, *config);
...

Thanks.

eyalhir74 · May 7, 2020, 4:48am

Hi,
So a few more followups please.
I’m running a single test convolution on both DLA and GPU. Those are the results I see for a 3x512x512 operation.


GPU:
Build network:

--------------- Layers running on DLA:
--------------- Layers running on GPU:

(Unnamed Layer* 0) [Convolution],
IProfiler Output:

Layer [5]: [(Unnamed Layer* 0) [Convolution]]: 2.68947ms
NVProf:

GPU activities:  100.00%  286.06ms       100  2.8606ms  2.5736ms  3.7559ms  void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1>, float, float, int=3, int=4, int=1, int=7, int=7, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)

API calls:   88.05%  2.20135s       100  22.013ms  33.216us  2.19461s  cudaLaunchKernel
DLA:

Build network:

--------------- Layers running on DLA:

(Unnamed Layer* 0) [Convolution],

--------------- Layers running on GPU:
IProfiler output:

Layer [6]: [input to nvm]: 0.246752ms

Layer [7]: [{(Unnamed Layer* 0) [Convolution]}]: 0.280928ms

Layer [8]: [input copy finish]: 0.063328ms

Layer [9]: [output from nvm]: 1.35133ms

Layer [10]: [output copy finish]: 0.00384ms

NVprof: ==11410== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 55.06% 11.445ms 100 114.45us 112.29us 119.08us void genericReformat::copyPackedKernel<float, __half, bool=1, bool=1, genericReformat::ArrayN<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::ArrayN<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::ArrayN<int=4>>, void const *, int, int, int, float const , void, void const *, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, void const *, int, int, int, float const , int=4) 44.94% 9.3408ms 100 93.408us 76.386us 779.86us void genericReformat::copyPackedKernel<__half, float, bool=1, bool=1, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const , void, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4) API calls: 94.25% 2.37585s 200 11.879ms 33.088us 2.36305s cudaLaunchKernel

Why does the GPU use a fusedConvolutionReluKernel and not cudnnXXXX?
What are all those (input to nvm, input copy finish, output from nvm, output copy finish) layers? Are those some sort of overheads when working with the DLA? Can they be avoided?
Why do you say that the DLA (The special hardware TensorCores) would run slower than on the GPU?
Regarding the buildEngineWithConfig - I’m using TensorRT 5. I guess this still can be done with creating two builders, networks and engines?
Are the answers to those question somewhere in the documentations? I’ve looked through the docs and samples but could not find something about performance tuning or more in-depth explanations that is available for CUDA?

Thanks a lot
Eyal

AastaLLL · May 20, 2020, 8:19am

Hi,

Sorry for the late update.

1. fusedConvolutionReluKernel is an TensorRT function, which fuses relu and conv operation for performance.

TensorRT is more than just reuse the cuDNN implement.

2. YES.
DLA use GPU to move the memory to nvm so you will see that on the GPU section.

3. DLA is not the Tensor Core hardware. It is an extra hardware from Xavier.
Tensor Core is part of Xavier GPU.

4. You can find more detail in our document:

5. Please find this section for the performance tuning.
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html

Thanks.

Topic		Replies	Views
DLA and GPU cores at the same time Jetson AGX Xavier dla	20	10150	October 18, 2021
General Question about jetson Xavier NX Jetson Xavier NX dla	15	1569	October 18, 2021
Run a part of DNN on DLA and part of DNN on GPU Jetson AGX Xavier dla	7	1149	February 14, 2023
DLA and GPU running at the same time - performance question Jetson AGX Xavier nvbugs , performance , dla	24	3112	October 18, 2021
Xavier NX does not support adaptative average pooling on DLA? Jetson Xavier NX tensorrt	27	1100	October 11, 2023
Deploy three AI model engines on both DLAs and GPU Jetson AGX Xavier tensorrt , jetson-inference , dla , gpu	4	616	September 26, 2023
I don't know the DLA. Jetson AGX Xavier	4	1241	October 18, 2021
Tensorrt Python API has a bug in DLA usage Jetson AGX Xavier tensorrt	11	626	August 17, 2022
Multiple models on DLAs in AGX Xavier 32TOPs Jetson AGX Xavier	13	1337	October 18, 2021
How can I customize matrix multiplication on DLA Jetson AGX Orin dla	12	133	September 25, 2024

DLA / GPU question

Related topics