I implemented a Gaussian filter in CUDA (3x3 kernel). I’ve done the following versions:
- Non-separable mode using shared memory
- Separable mode using shared memory
I did not try never do so only with the global memory, as I’m sure is very slow.
The option 2 runs much faster.
I read that with texture memory can go faster, but I also read the opposite. My question is:
Could I get texture memory with less time to execute than option 2 for my project?
I’m using a GeForce GTX970.
I hope you can help me …
Not sure about Maxwell architecture. But for Kepler arch. see
and the paper
‘Communication-Minimizing 2D Convolution in GPU Registers’ (http://parlab.eecs.berkeley.edu/publication/899).
For compute capability >= 3.0, we tend to prefer texture memory (using strategies from the mentioned paper) for our image processing routines (operating on 2-D pitch-linear memory) because it is faster to write, avoids explicit handling of the image borders and yields a more compact source code (especially when implementing a templatized function for multiple bit depths)
I would use a code based on texture memory and compare it with mine code Where would you find a code like this?
Source code for the implementions from the mentioned paper is at