This might sound like an apples vs oranges comparison at first, but it isn’t.
On various devices, I noticed that 2-D convolution from CUDNN is slower than SGEMM from CUBLAS. For example, on my GTX 980, I get up to 4TFLOPS in one and never more than 2TFLOPS in the other (assuming the data is already on the device).
This is especially puzzling, because for some input geometries, conv2d is exactly equivalent to SGEMM: one can just call SGEMM instead of conv2d (This happens when the input’s height and width are equal to the filter’s height and width, respectively). And yet, I see a 2-fold difference in performance in these cases.
I thought that both SGEMM and conv2d were arithmetic-bound, and that the CUDNN implementation was well-optimized.
(Technically, I’m talking about and timing cross-correlation, rather than convolution)