DLA convolution performance considerations

tobias.fischer1 · February 14, 2021, 1:19pm

Is there any reference for performance tuning of convolutions on DLA? I imagine something like the one for GPUs.

One specific consideration would be building a TensorRT engine with tlt-converter. You can specify the input dimension ordering. Usually cameras output (N)HWC, which is also optimal for tensor cores. However the DLA convolution engine might have different characteristics, and transposing also induces cost.

I could benchmark this myself, but thought Nvidia probably has done so already :)

tobias.fischer1 · February 14, 2021, 6:28pm

I just found out that the TensorFormat CHW32 is advertised as the DLAs “native format for INT8”. Is it correct to assume that using that format for convolutions is most performant?

eyalhir74 · February 15, 2021, 6:03am

Hi,
From what I saw, yes.
You can just try it out with nvidia’s trtexec utility. Build the network for dla and run/test it as you like

thanks
Eyal