Hi,
So a few more followups please.
I’m running a single test convolution on both DLA and GPU. Those are the results I see for a 3x512x512 operation.
GPU:
Build network:
--------------- Layers running on DLA:
--------------- Layers running on GPU:
(Unnamed Layer* 0) [Convolution],
IProfiler Output:
Layer [5]: [(Unnamed Layer* 0) [Convolution]]: 2.68947ms
NVProf:
GPU activities: 100.00% 286.06ms 100 2.8606ms 2.5736ms 3.7559ms void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1>, float, float, int=3, int=4, int=1, int=7, int=7, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
API calls: 88.05% 2.20135s 100 22.013ms 33.216us 2.19461s cudaLaunchKernel
DLA:
Build network:
--------------- Layers running on DLA:
(Unnamed Layer* 0) [Convolution],
--------------- Layers running on GPU:
IProfiler output:
Layer [6]: [input to nvm]: 0.246752ms
Layer [7]: [{(Unnamed Layer* 0) [Convolution]}]: 0.280928ms
Layer [8]: [input copy finish]: 0.063328ms
Layer [9]: [output from nvm]: 1.35133ms
Layer [10]: [output copy finish]: 0.00384ms
NVprof:
==11410== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 55.06% 11.445ms 100 114.45us 112.29us 119.08us void genericReformat::copyPackedKernel<float, __half, bool=1, bool=1, genericReformat::ArrayN<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::ArrayN<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::ArrayN<int=4>>, void const *, int, int, int, float const , void, void const *, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, void const *, int, int, int, float const , int=4)
44.94% 9.3408ms 100 93.408us 76.386us 779.86us void genericReformat::copyPackedKernel<__half, float, bool=1, bool=1, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const , void, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
API calls: 94.25% 2.37585s 200 11.879ms 33.088us 2.36305s cudaLaunchKernel
- Why does the GPU use a fusedConvolutionReluKernel and not cudnnXXXX?
- What are all those (input to nvm, input copy finish, output from nvm, output copy finish) layers? Are those some sort of overheads when working with the DLA? Can they be avoided?
- Why do you say that the DLA (The special hardware TensorCores) would run slower than on the GPU?
- Regarding the buildEngineWithConfig - I’m using TensorRT 5. I guess this still can be done with creating two builders, networks and engines?
- Are the answers to those question somewhere in the documentations? I’ve looked through the docs and samples but could not find something about performance tuning or more in-depth explanations that is available for CUDA?
Thanks a lot
Eyal