How are the 2D and 3D convolutions implemented in cuDNN? If I understand the documentation correctly it looks like a dense matrix vector multiplication is used, which seems very inefficient. Why not use sparse matrix vector multiplications instead, as 3 x 3 filters will generate very sparse convolution matrices. Also, is there some comparison between different ways of performing the convolutions, for example to a convolution implementation which is not done through a matrix multiplication, but where each GPU thread performs convolutions for one pixel.

In the doc there are several flags just for the forward pass typed cudnnConvolutionFwdAlgo_t. Those are the different algorithms that can be used just for the fwd pass. If you use the function cudnnFindConvolutionForwardAlgorithm() it will fill an array in ascending order by time of calculation of cudnnConvolutionBwdDataAlgo_t. These are needed because of workspace size.