DLA / GPU question

I have a few questions about the DLA / GPU feature.
If I understood correctly when I use the IBuilder’s setDefaultDeviceType I either make the network run on the DLA, which is the actual tensor cores special hardware, or the GPU, which is the “regular” CUDA cores.
Is that right?

Suppose I want to run three different networks at the same time, can I create three different builders/execution engines on DLA 0, DLA 1 and GPU and have them run at the same time? on different CUDA streams?
If its possible, how do I set this? because I see the setDLACore is a IRuntime method. Can I have three different IRuntime objects? how do I connect each network to a different runtime/hardware?

Hope this makes sense :)



1. You can check the spec below. DLA is an extra hardware, not indicates GPU.

2. Yes, you can run DLA0, DLA1 and GPU at the same time.

3. Just create 3 independent TensorRT will be fine.


Regarding 2 - so I might see a performance of X3 (or whatever) compared to running serially on only one of them?
Regarding 3 - what do you mean create 3 independent TensorRT? what objects? in the same application?

BTW - Since you’ve answered so much of my questions, I’d be happy to know your name so I can thank you more personally :)



It’s expected that DLA run slower than GPU and the complexity is also smaller.
In the same application, you can create multiple TensorRT engines like this:

ICudaEngine* engine1 = builder->buildEngineWithConfig(*network1, *config);
ICudaEngine* engine2 = builder->buildEngineWithConfig(*network2, *config);


So a few more followups please.
I’m running a single test convolution on both DLA and GPU. Those are the results I see for a 3x512x512 operation.


Build network:
--------------- Layers running on DLA:

--------------- Layers running on GPU:
(Unnamed Layer* 0) [Convolution],

IProfiler Output:
Layer [5]: [(Unnamed Layer* 0) [Convolution]]: 2.68947ms

GPU activities: 100.00% 286.06ms 100 2.8606ms 2.5736ms 3.7559ms void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1>, float, float, int=3, int=4, int=1, int=7, int=7, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
API calls: 88.05% 2.20135s 100 22.013ms 33.216us 2.19461s cudaLaunchKernel

Build network:
--------------- Layers running on DLA:
(Unnamed Layer* 0) [Convolution],
--------------- Layers running on GPU:

IProfiler output:
Layer [6]: [input to nvm]: 0.246752ms
Layer [7]: [{(Unnamed Layer* 0) [Convolution]}]: 0.280928ms
Layer [8]: [input copy finish]: 0.063328ms
Layer [9]: [output from nvm]: 1.35133ms
Layer [10]: [output copy finish]: 0.00384ms

==11410== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 55.06% 11.445ms 100 114.45us 112.29us 119.08us void genericReformat::copyPackedKernel<float, __half, bool=1, bool=1, genericReformat::ArrayN<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::ArrayN<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::ArrayN<int=4>>, void const *, int, int, int, float const , void, void const *, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, void const *, int, int, int, float const , int=4)
44.94% 9.3408ms 100 93.408us 76.386us 779.86us void genericReformat::copyPackedKernel<__half, float, bool=1, bool=1, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const , void, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
API calls: 94.25% 2.37585s 200 11.879ms 33.088us 2.36305s cudaLaunchKernel

  1. Why does the GPU use a fusedConvolutionReluKernel and not cudnnXXXX?
  2. What are all those (input to nvm, input copy finish, output from nvm, output copy finish) layers? Are those some sort of overheads when working with the DLA? Can they be avoided?
  3. Why do you say that the DLA (The special hardware TensorCores) would run slower than on the GPU?
  4. Regarding the buildEngineWithConfig - I’m using TensorRT 5. I guess this still can be done with creating two builders, networks and engines?
  5. Are the answers to those question somewhere in the documentations? I’ve looked through the docs and samples but could not find something about performance tuning or more in-depth explanations that is available for CUDA?

Thanks a lot


Sorry for the late update.

1. fusedConvolutionReluKernel is an TensorRT function, which fuses relu and conv operation for performance.

TensorRT is more than just reuse the cuDNN implement.

2. YES.
DLA use GPU to move the memory to nvm so you will see that on the GPU section.

3. DLA is not the Tensor Core hardware. It is an extra hardware from Xavier.
Tensor Core is part of Xavier GPU.

4. You can find more detail in our document:

5. Please find this section for the performance tuning.