I’ve been trying to offload some GeMM computations to DLA.
However, I find that the DLA computation for the operation as compared to the GPU is almost 100x slower.
Is this expected behaviour or is something going wrong here?
Testing methodology
Conversion of a 1024x1024 sized Pytorch nn linear layer exported to ONNX, converted via trtexec
The ONNX file and the trtexec data is also attached. Note : The ONNX file has a .txt extension because the forum wouldn’t let me upload a .onnx file.
Here are some possible factors why the DLA throughput is much less than GPU:
1. The 2d GEMM(NxC, CxK) here gets translated to 4d Convolution1x1(NxCx1x1, KxCx1x1)
→ The single Conv node is surrounded by Reshapes/Transposes
2. DLA has lower available DRAM BW than GPU.
3. For multi-batched Convolution, the performance sweet spot is the batch size of <16 rather than the bs=1024 here
A much better perf benchmark for the DLA would be running, for example, 3x3 convolutions with input & output channels being multiples of 64, and a sufficiently large input resolution.
For FP16 and INT8 performance difference, please check below description for info as well: