I defined my own network. Then doing the following:
tensorflow mode → onnx → onnx parser(c++ inference)
if no use
::enableDLA(builder.get(), config.get(), mParams.dlaCore);
I almost get 100ms speed up.
if use this code, it means use DLA core, but some layers run on dla, and some run on GPU.
I have no idea now why it has this kind of big difference.
May be the memory copy between dla and gpu make sense, i guess.
Have you enabled the device performance first?
$ sudo nvpmodel -m 0
$ sudo jetson_clocks
Please noticed that DLA has limited resource so some operation will need to wait for the resources.
The memory is shareable between GPU and DLA via EGLStreams so there is no memcpy within TensorRT inference.
I done this as you suggested, but only get 20ms speed up. Slower than no use DLA.
This is under the expectation.
Please find this blog for some information from Xavier:
For NX, each DLA is 4.5 TOPS and GPU has 12.3 TOPS.
Please noticed that DLA targets for offload the GPU tasks to allow GPU for other issues.
The performance of DLA is related to the model size since the DLA capacity is much smaller than GPU.