Jetson TK1 (L4T, CUDA 6.5, cuDNN 2.0)
Convolutions are slow on this hardware. Based on a 256x256x3 input image (format NHWC), convolved against a 3x3x64 filter (format KCHW), output tensor 256x256x64 (format NCHW), applied with [cudnnConvolutionForward()]. I am applying biases to the convolution output [cudnnAddTensor()], and performing RELU on the final output [cudnnActivationForward()].
This convolution is taking 10.73ms.
Is this the expected performance for such a small convolution, or are there optimisation tricks I am missing?