Benefits of Texture Memory couldnt use them...


I was experimenting with device memory reads through texture fetching. In the programing guide (section 5.4) one of the benefits is that “they are not subject to the constraints on memory acces patterns that global or constant memory reads in order to get good peformance”. Well I was checking the cuda file provided by MisterAnderson (Nvidia Topic) which provides a bandwidth check. Well I was changing the access pattern from
const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
const unsigned int idx = threadIdx.x * blockDim.x + blockIdx.x ;
hoping that the access is almost as quite fast as it is promised in the programing guide. The bandwidth droped down from read-only-texture 30 GiB/s to 8 GiB/s (GTS 8800). Similar behavior for the other datatypes and reading types.

Did I misunderstand something in the programing guide ? Why are device memory reads through texture fetching behaving similar to normal device memory reads if they are not subject to any constraints ?


You have not the contraints of memory banks and coalescing, but if you are going to be reading N values that are not close together (and don’t repeat) it is going to take the same time as global memory.

A texture fetch has a localized cache, so getting random values that are close together, you will benefit from the cache.

okay, I understand. But if someone wants to apply some filters on an image while the image is saved lets say column-wise. Then I have to apply the filter to areas which are not close together at all. How does cuda handle that?

thanks for the reply.

For image filters i usually use 2D textures which are pretty fast when 2D locality is given. And filters use the same data for different calculations so you can also profit from the texture cache.
You have to copy the image data to a cudaArray before, to be able to use 2D textures though.

Hmmmm, okay.I was thinking to use texture memory for linear algebra routines, such as vector matrix multiplication. But I think that 2D locality is not given. So caching does not really help improving the performance.

thx for the replies

Think of the cache as a way to read “almost coalesced” values. If all 32 threads in a warp read values close to each other in the texture (where close can be 1D or 2D depending on the texture type), then you will achieve maximum throughput as if you had a coalesced read. For particular linear algebra routines, this might be convenient if coalescing is difficult.

Well, the two expressions are totally different.

Consider I am launching 5 blocks with 512 threads. WIth each thread correspodning to one element. the total number of elements would be 5*512

The first case of “idx” calculating is straight-forward. I dont need to explain anything there.

The second case: Consider the 511th thread with blockIdx.x as 4. THe expression would evaluate to “511*512 + 4”. This is totally menaningless.

Thats why probably you are seeing strange bandwidth.