When I am converting ONNX model to TRT with dla and allow gpu fallback enabled, at that time some layers have been shuffled on DLA and GPU.
I would like to know how tensorrt api shuffles the layers internally.
Is there any specific logic? If yes, can you please provide that logic?
Also can developer decide which layers to shuffle on either GPU or DLA?
When allowGpuFallBack flag is enabled, it moves the non DLA supported layers to GPU. The DLA supported layers will run on DLA. Please check DLA supported layers at Developer Guide :: NVIDIA Deep Learning TensorRT Documentation
When I inference a model with DLA enabled, I have observed that most of the layers fall back to GPU. I am getting the following results:
As we can observe from the table, the throughput is very high with only GPU, but very less with DLA(with GPU fall back enabled).
Why is the throughput so low with DLA when most of the layers fall back to GPU ? Is there any specific reason?
It is expected to have better perf result on GPU compared to DLA as GPU is powerful.
When using DLA+GPU, there will some inherent data transfer of intermediate layer outputs across GPU and DLA which increases over all execution time.
In DLA, though there is intermediate data transfer, why there is a big difference in DLA, and GPU + 2 DLA’s(multithreaded) throughput results?
Please refer to the DLA FAQ page to see if that addresses some of your questions: Deep-Learning-Accelerator-SW/FAQ
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.