Why run slower when use DLA and GPU together , even if the DLA model was transfromed all in DLA?

why run slower when use DLA and GPU together , even if the DLA model was transfromed all in DLA ?
there is my model for testing:
dla_w117.onnx (9.0 MB)

transfrom to dla
./trtexec --onnx=dla_w117.onnx --fp16 --useDLACore=0 --workspace=1024 --saveEngine=ssd .trt

transfrom to gpu
./trtexec --onnx=dla_w117.onnx --fp16 --workspace=1024 --saveEngine=ssg.trt

there is my code for testing:

testmodel.tar.xz (4.1 MB)

How can DLA model not affect GPU models?

Hi,

What performance do you get with and without GPU running?
We test your model, and DLA run 139.415 ms and 162.466 ms without vs. with GPU.

Although the model can be deployed on the DLA, the memory and bandwidth is shared.
Would you mind to check if the regression comes from data IO with our profiler first?

Thanks.

how to check if the regression comes from data IO with your profiler ?

Hi,

You can compare the execution time for inference block and memory block.
Usually, memory related function will have the name like cudaMemcpy.

Thanks.

i didn’t use function like cudaMemcpy, just initEngine from file and created memory for engine input/output once .
does the bandwidth cost in bool IExecutionContext::execute(int batchSize, void** bindings) ?
does the memory copy happen in the interface execute ?

Hi,

Some memory transfer or copy is required when running the TensorRT.
Have you profile the application? You should find some memory related job with the profiler.

Thanks.