Running a standard discrete separable 3D gaussian convolution on a 768x768x32 input set, but the convolution down the Z dimension is taking about 4x the time for the X, or Y dimension.
While this is somewhat to be expected, I wonder what coalesced memory approach may be applied?
Have already went through the related code in the CUDA SDK, looked around a bit via Google checked out the first results like this:
Most published work focuses on the 2D example, while it ends up that the third dimension (assuming contigous column major format) is the bottleneck.
Using my current shared memory approach running time is decent, but I suspect there is significant room for improvement in the third Z dimension kernel.
Any ideas/links/papers/code to that specific topic?