3D texture based separable convolution extension of SDK example

Hi, I haven’t seen this posted so I thought I would post it. It’s a simple extension of the 2D texture based separable convolution that is found in the SDK. It would be a good exercise for someone who is new to CUDA to do possibly. I chose the textures for ease of programming/readability. I am aware that use of shared memory could be faster.

Currently it reads the convolution kernels from a Matlab .mat file, so you may need to change the where it looks for the Matlab header/library files on your computer, or it is easy to comment that part out, and use a random kernel as in the SDK.

I have not tried to optimize the block size/thread count/occupancy, mainly it’s a proof of principle. I see speed ups of ~180x to comparable single threaded CPU code on my GTX 260 using image size of 2048x512x64 (67,108,864 voxels).
seperableConvolutionTexture3d.zip (227 KB)

I tried your code and it works but for me about 50% of the time is spent on copying the filter response back to the texture between the convolutions, since it is not possible to write to texture memory. I implemented a separable 3D convolution with shared memory and it is about double as fast. With Fermi it will be even faster since then more blocks can run at the same time (48 KB of shared memory instead of 16 = 3 times faster?), while I think that the performance of the texture based version will be about the same since it is memory bound and the memory bandwidth will not increase that much for the global memory in Fermi.