Texture Memory vs Shared Memory....


I implemented a Gaussian filter in CUDA (3x3 kernel). I’ve done the following versions:

  1. Non-separable mode using shared memory
  2. Separable mode using shared memory

I did not try never do so only with the global memory, as I’m sure is very slow.

The option 2 runs much faster.

I read that with texture memory can go faster, but I also read the opposite. My question is:

Could I get texture memory with less time to execute than option 2 for my project?

I’m using a GeForce GTX970.

I hope you can help me …

Not sure about Maxwell architecture. But for Kepler arch. see
and the paper
‘Communication-Minimizing 2D Convolution in GPU Registers’ (http://parlab.eecs.berkeley.edu/publication/899).

For compute capability >= 3.0, we tend to prefer texture memory (using strategies from the mentioned paper) for our image processing routines (operating on 2-D pitch-linear memory) because it is faster to write, avoids explicit handling of the image borders and yields a more compact source code (especially when implementing a templatized function for multiple bit depths)

I would use a code based on texture memory and compare it with mine code Where would you find a code like this?

Source code for the implementions from the mentioned paper is at