DLA convolution performance considerations

Is there any reference for performance tuning of convolutions on DLA? I imagine something like the one for GPUs.

One specific consideration would be building a TensorRT engine with tlt-converter. You can specify the input dimension ordering. Usually cameras output (N)HWC, which is also optimal for tensor cores. However the DLA convolution engine might have different characteristics, and transposing also induces cost.

I could benchmark this myself, but thought Nvidia probably has done so already :)

I just found out that the TensorFormat CHW32 is advertised as the DLAs “native format for INT8”. Is it correct to assume that using that format for convolutions is most performant?

From what I saw, yes.
You can just try it out with nvidia’s trtexec utility. Build the network for dla and run/test it as you like