When to use textures

I am trying to write an image processing library with CUDA. When is it best to use texture memory as opposed to global memory when writing image filters (gaussian sobel, etc)? I know that texture memory is cached, so should i always be using texture memory? Why doesn’t the convolutionSeperable sample use texture memory?

If I was to use texture memory for all my images (8bit int mono, 16bit int mono, 32bit float mono), how would I chain filters without having to recopy the data? For example I wish to compute a gradient image and have it stay resident on the device and then use it as an input to another filter (no point in copying it back to sys mem only to copy back to the device again?). The gradient image would have to be generated into global memory, so how would I treat that same image memory as texture memory in the next pass? Would I have to do a device to device transfer to accomplish this? Is that slow? I used to do this all the time in direct3d without having to do the copy (at least not explicitly), so would this device to device transfer int CUDA cause my chaining to run slower than direct3d?

A coalesced global memory read will be faster than a texture. Only use textures when you really need the cache because you can’t coalesce your reads.

If you do need a 2D texture read/modify, then you need to write to global memory and do a device to device copy. There are a few other recent threads on this topic. Another option (if you don’t need a 2D cache) is to use a 1D texture bound to device memory and write directly back into it. As long each thread only reads the same values it writes, there are no race conditions involving memory accesses.

If your access pattern is predictable you might be able to get most efficiency with coalesced global memory reads (size float2 or int2) and shared memory. If there is some randomness (but locality) in your reads, textures might be the way to go.

Device to device copy is very fast, I doubt it will ever become a bottleneck of your algorithm. Still, it takes some time of course and might be avoided if you do things like convolution in a small area.

This is not an answer to the Ryan’s question. But in fact, this is my own question. I thought I would add my question in a relevant place than start a new thread:

Here is my noob question:

I have never used textures before. All I know is that they are cached and hence could be faster if used in the right way.

One of my kernels accesses an array in global memory as read-only. (the array was generated by a previous kernel). Each element of this array is accessed only once and there is no repetition. In such a case – Will there be any use in using a texture instead of a global memory? Is there a way to move these arrays to “constant” memory somehow by the host application?

Thanks for your time.

Yes, there will be an improvement in data throughput using tex1Dfetch bound to global memory, as long as threads in a warp access nearby elements in the texture.

But then, I access each element only once. Does not the texture cache function like CPU cache – in the sense – the first time the CPU access, it comes from main memory – only subsequent ones use the cache.

So, I thought if I access elements only once then there is no benefit from the cache. (unless one cache miss fetches a whole lot other neighbouring elements).

And, If you could clarify my second question on moving “global memory” to “constant mem”, it would be great!

Try this for copying data from global memory space into constant memory space (const_name should be predefined and have appropriate size): cudaMemcpyToSymbol( “const_name”, gmem_addr, nBytes, 0, cudaMemcpyDeviceToDevice );

As for your first question: if all your reads are coalesced then you probably will gain nothing from using textures.

I have never used constant memory and texture before. So, I am going to try tomorrow.

Thanks for all your replies.