Is there any reference for performance tuning of convolutions on DLA? I imagine something like the one for GPUs.
One specific consideration would be building a TensorRT engine with tlt-converter. You can specify the input dimension ordering. Usually cameras output (N)HWC, which is also optimal for tensor cores. However the DLA convolution engine might have different characteristics, and transposing also induces cost.
I could benchmark this myself, but thought Nvidia probably has done so already :)