Slow convolution speeds on TK1

Jetson TK1 (L4T, CUDA 6.5, cuDNN 2.0)

Convolutions are slow on this hardware. Based on a 256x256x3 input image (format NHWC), convolved against a 3x3x64 filter (format KCHW), output tensor 256x256x64 (format NCHW), applied with [cudnnConvolutionForward()]. I am applying biases to the convolution output [cudnnAddTensor()], and performing RELU on the final output [cudnnActivationForward()].

This convolution is taking 10.73ms.

Is this the expected performance for such a small convolution, or are there optimisation tricks I am missing?

I’m assuming that having external memory outside the Tegra K1 GPU is a serious bottleneck in these kinds of operations.

The question is, how much of a significant performance boost would I get executing this same operation on a Tegra X1 or X2 (on Jetson boards)? I’m assuming the FP16 on the TX1/TX2 and onboard GPU memory is going to significantly improve performance on these operations.

Hi psdeering,

I can’t tell the exactly performance improvement you can get as I didn’t try it on all platforms, but as our supporting at the other topic you posted, it’s recommended to use TX1 or TX2 for deep learning use case.

NHWC is supported in our newer cuDNN version but is not available for TK1.
http://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#four-D-tensor-descriptor

Thanks