When is it worth copying global to texture memory

Hi all,
I’m trying to convert some OGL shaders to CUDA. My algorithms need multiple passes/shaders/kernels as the output of one pass is is the input of the next pas (most of the times).

Some of the algorithms access only one pixel and generate one output pixel, other algorithms use multiple neighbouring input pixels to generate one output pixel.

The question is if there’s any golden rule as to when the expence of copying from global memory to texture memory pays off (so that the texture cache is used).

Is there any point in copying the output from one kernel from global to texture memory if the next kernel:

  • accesses only one pixel
  • accesses 3 neighbouring pixels
  • accesses 9 pixels (I guess here texturing pays off)


Hi Mark,

For “linear” arrays of data you can bind to a texture without any overhead of copying. This allows you to use the texture cache without any up front overhead. I’m not familiar with “cuda arrays”, but I’ve read that they allow two and three dimensional spatial locality in the cache (as well as interpolation, and other features) but have the downside that you must memcpy into and out of them.

See this thread, and also look at the CUDA SDK samples for the special texture declaration and binding syntax.

In my experience, in order to get maximum performance, it is necessary to implement the same kernel several different ways and benchmark each one. In my application on the Tesla C870 I get the most bandwidth by using linear textures of float4 elements.


The answer to all 3 questions boils down to one thing: Can you coalesce your global memory reads? If you can, then you will see probably see no benefit from using texture reads. Shared memory can be a big help here: load a block into shared memory and then neighboring threads can read the neighboring pixels from there, saving many global memory reads. I believe a few of the SDK examples show how to do this.

If you cannot coalesce all of your reads (maybe because some are random or because it would require a lot of code logic to manage the coalescing) then you will benefit from using the texture cache. The key to obtaining the most performance out of the texture cache is to have all threads in a warp access spatially local values in the texture (1D locality if using tex1Dfetch and device memory, or 2D locality if using tex2D and array memory).