For what case should I use texture memory(instead of direct global memory access)?
I’ve verified the performance of global memory access via texture memory in some cases.
However, it was slower than direct global memory access in any case.
I wonder what the case suitable for texture memory usage is.
And,
Does someone know about detailed behavior of texture caching?
I understand the texture data close 2D spatially is cached when texture memory space accessed.
But, How close?
Who shares the same texture cache?
Threads in the same warp, same block, or all threads?
Thanks in advance, and I’m sorry for my poor English.
I never bought this. Turning two 2D indices (which you need in order to to address a 2D tex) into one 1D index is a single MAD ( i = x + y*xdim).
If you’re doing so many of those MADs that it might become measurable, you’re probably doing just as many memory reads afterwards, at which point your kernel gets bandwidth bound. Unless you’re hitting the L1 tex cache but at this point you’re benefiting from caching, not indexing.
A single MAD before a memory read is nothing.
If you need the wrapping behavior, there’s just an additional modulo. Clamping behavior - min and max functions.
Now, caching and imperfect coalescing - that’s more advantageous. Free filtering might make sense if you’re doing the bilinear kind. 1D linear filtering is probably more trouble than it’s worth (only 256 steps between values and you only save like 5 arithmetic instructions).
NVIDIA doesn’t publish the details, so all that anyone can do is guess.
The best bandwidth is delivered by the tex cache when the threads in each individual warp access values near each other in memory. The texture cache is too small to present any level of temporal locality and thread scheduling prevents spatial locality between threads in a block from contributing much.
If you want to read more of my musings on the texture cache, search the forums using google:
NVIDIA doesn’t publish the details, so all that anyone can do is guess.
The best bandwidth is delivered by the tex cache when the threads in each individual warp access values near each other in memory. The texture cache is too small to present any level of temporal locality and thread scheduling prevents spatial locality between threads in a block from contributing much.
If you want to read more of my musings on the texture cache, search the forums using google: