Memory optimal approach to Z dimension in separable 3D convolution.

Running a standard discrete separable 3D gaussian convolution on a 768x768x32 input set, but the convolution down the Z dimension is taking about 4x the time for the X, or Y dimension.
While this is somewhat to be expected, I wonder what coalesced memory approach may be applied?

Have already went through the related code in the CUDA SDK, looked around a bit via Google checked out the first results like this:

Most published work focuses on the 2D example, while it ends up that the third dimension (assuming contigous column major format) is the bottleneck.

Using my current shared memory approach running time is decent, but I suspect there is significant room for improvement in the third Z dimension kernel.

Any ideas/links/papers/code to that specific topic?

Afaik CUDA 3D texture already uses a space-filling curve approach under the hood, so spatially close points in 3D will more than likely be mapped to array indices that are close, but I could be wrong.

If the accuracy isn’t a concern and/or your dataset has smooth structures, you can look into approximate bilateral filtering (which, when operating on 2D grayscale images, IS 3D filtering)/high-dimensional filtering approaches. One such example is


Thanks, but in my situation I need 32 bit accuracy and have to implement this filter so it returns the exact same result as MATLAB convn().

What is interesting is that the number and placement of __syncthreads() around the global memory reads and writes is having a massive impact on running time. This seems to be because the number and placement of __syncthreads() impacts the number of registers used, which impacts occupancy.

Even then I do not need the __syncthreads() at those locations (in the typical sense), adding them improves register utilization and performance by as much as 50%.

Oh, and Janet please wait until 2017 to raise interest rates, because my Granny owns Government bonds and needs to preserve her buying power. The TIPS to 10 year spread (breakeven yield) seems to be moving up, but is it enough to justify a rate increase?

What if you do the convolution on X and Y dimension, transpose the data, then do the convolution on Z dimension?

I suspect the __syncthreads() is stopping the compiler from “caching” global memory in registers. Do you see the same if you declare the global memory variable volatile?

That is a great idea which I had not yet thought about, and probably will work.