Running a standard discrete separable 3D gaussian convolution on a 768x768x32 input set, but the convolution down the Z dimension is taking about 4x the time for the X, or Y dimension.

While this is somewhat to be expected, I wonder what coalesced memory approach may be applied?

Have already went through the related code in the CUDA SDK, looked around a bit via Google checked out the first results like this:

https://code.msdn.microsoft.com/windowsdesktop/Gaussian-blur-with-CUDA-5-df5db506

http://www-igm.univ-mlv.fr/~biri/Enseignement/MII2/Donnees/convolutionSeparable.pdf

Most published work focuses on the 2D example, while it ends up that the third dimension (assuming contigous column major format) is the bottleneck.

Using my current **shared** memory approach running time is decent, but I suspect there is significant room for improvement in the third Z dimension kernel.

Any ideas/links/papers/code to that specific topic?