Is coalescing access important to texture memory?

From the tutorials I read, seems coalescing is always associated with global memory.

Texture memory is cached, while global memory is not. So is coalescing access meaningful to texture memory read/write as to global memory?


No, but it is useful to have some local coherence in the texture. So within a warp access elements that are close together, since the texture cache has some features to optimize that (a load from the texture cache probably fetches also the nearest neighbors so when you want to read those too, they are already in cache)

The texture memory is “physically” the same as global memory , right? So with cacheing, the first time fetching data from texture also benefits from coalescing, is that correct?

I am not sure anybody outside of NVIDIA knows for sure, but I would guess so. Als when using 2D-textures there is 2D-locality, so there might be some other trick/technique employed.

Anyhow, when treating textures as a black box, I would like to quote Mr Anderson in saying: textures are very useful when you have almost coalesced accesses. But also for random access it should be useful.

btw, if accesses are always coalesced, you wont get any benefits from texture memory. It is better to leave it global. This was discussed sometime back in this forum.

its even more definite if you can have all your accesses coalesced then you will get better performance using non texture memory.

Interesting point… But why?

‘coalesed read from texture memory to shared memory’ should not be worse then 'coalesed read from global memory to shared memory, ’ if not better, right?

Reading from textures requires the usage of a few extra registers for addressing and calling the texture unit. The extra register usage can change your occupancy, and the total throughput through the texture unit is slightly less than than from a coalesced read.

Due to the considerations, if you can coalesce it is advisable to do so instead of using the texture read, especially when performing multiple reads within a thread. The only exception is when reading 128-bit types where textures are faster than coalesced reads.

That isn’t always true. There are apps that achieve a slightly higher memory throughput reading from textures (fully coalesced addressing, if it were done for gmem) and writing result to gmem.


Paulius, is there a general guideline when this is the case? (for instance memory-bound kernels) Otherwise I need to rewrite quite a lot of code to see if this is true for my kernels.

Thats interesting … do u know why ?