I find that textures are not explained very well in the Programming Guide/Reference Manual.

My understanding:

A Texture Reference must be declared at file scope:
texture<Type, Dim, ReadMode> texRef;

eg. texture<Float, 2, cudaReadModeElementType> tex1;
declares a texture reference called tex1 which is 2D, returns floats, and is accessed using non-normalised coordinates.

We will assume that we want to use the texture to access a Cuda Array, so in main() we must declare:
cudaArray *cuA;

Now we need a Channel Format Descriptor
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
I am not very clear why this is necessary, and what it does. I also do not know what int x, int y, int z, and int w refer to. I think they are something like the number of bits in each dimension returned when the texture is accessed, but then it doesn’t make sense why in the examples I have seen, even when a 2D texture is being used, only ‘x’ is non-zero.

We need to allocate memory on the GPU our Cuda Array now:
cudaMallocArray( &cuA, &channelDesc, 3sizeof(float), 1sizeof(float) );
I believe this should allocate a 3 x 1 array of floats

We need to put something into our Cuda Array:
cudaMemcpyToArray( cuA, 0, 0, h_data, 3*sizeof(float), cudaMemcpyHostToDevice);
This copies 3 floats from the h_data array to our Cuda array ‘cuA’

We need to bind our texture (tex1) to the Cuda Array, so that we can access the Cuda Array through the texture:
cudaBindTextureToArray(tex1, cuA, channelDesc);

We can now access our Cuda Array through the texture (which will hopefully improve performance due to caching) like this:
*(dat+ threadIdx.x)=tex2D(tex1, threadIdx.x, 0);
(where ‘dat’ is a pointer to some memory allocated on the device for floats)
This accesses cuA[threadIdx.x][0].

I believe it is not possible to alter textures from the kernel, so this implies that binding the memory (in our case the Cuda Array cuA) to a texture makes a copy of the memory, so if we made a change to cuA it would not alter the value returned by tex2D unless the texture is re-bound to cuA.

I believe that one does not have to worry about coalescing memory reads when reading a texture.

I have read that texture reads are fastest when one is accessing cells of the array that are close together (‘2D spatial locality’), which I presume has something to do with how the cache works. I understand this to mean that if I am accessing [a][b] and [a+1][b], say, I will probably get better speed that if I accessed [a+54][b] instead.

The Texture Cache is 16KB per processor. I assume that this is a different 16KB to the Shared Memory, so using the Texture Cache does not reduce available Shared Memory, so it is a good idea to use Textures for memory that does not need to be modified within a kernel to leave more Shared Memory for memory that does need to be modified, while possibly reducing accesses to global memory.


Binding does not copy the data to a new location. The reason you cannot alter a texture is because the cudaArray storage format is non-linear. To speed up reads with 2D or 3D spatial locality, cudaArrays are stored using some form of space-filling curve. See:


for an example of such a data ordering. The problem is that whatever curve NVIDIA uses is not documented, and is subject to change in different hardware. So you cannot translate from physical (x,y) coordinates to a linear offset in the array yourself, and instead must rely on the texture hardware to do it for you (which only works in the read direction).

The exception to this are 1D textures bound to linear memory (rather than a CUDA array). You can write directly to the linear memory in your kernel, although then you will potentially have a cache-coherency problem, as there is no mechanism to expire cache lines in the texture cache when you alter global memory this way. Still, for some problems, this is acceptable and people do in fact do it.

(Edit: I should point out there is one obvious way to expire the texture cache: Let your kernel finish. On the next kernel launch, the texture cache is flushed and repopulated with the updated values from global memory.)

Rebinding textures is relatively fast, and it is a handy tool for switching between data sets in subsequent calls to a kernel. For example, in one of my programs I load 4 different cudaArrays into memory, and rebind the array I want to a global texture reference just before calling the kernel.

Yes, see the Z-order comment above. A “linear” read of a CUDA array by the cache hardware will obtain cells which are near each other. Given the width of the memory bus, it has to make large consecutive reads to get good bandwidth.

It is important to keep in mind that if you are doing completely random reads in a large array, texture memory will be no faster than global memory. The element you want is unlikely to already be in the cache, and the space filling curve will not help you at all.

Yes, the texture cache is entirely separate from shared memory. Similarly, the constant memory cache is also separate from both of them.

Read mode is not about accessing coordinates but how to format returning data:

Setting the access using normalized or non-normalized coordinates is made later (and not necessarily at file scope) by something like:

tex1.normalized = 1; //for true, 0 for false

(page 49 in the guide)

That’s one step you seem to have forgotten - managing the texture reference. This is where you set filtering and addressing modes.