About texture cache and spatial locality

As per documented in the programming guide, the texture cache is optimized for 2D spatial locality. However, i have never seen (officially) mentioned whether this extends for 3D textures and 3D spatial locality, or whether 3D textures do not get this benefit.

Hello up there at Michigan Tech. I got my undergrad degree there :)

There wouldn’t be much of a point to 3D textures if they didn’t offer 3D spatial locality in there accesses. This has been mentioned somewhat officially in some forum posts before, but not recently. You could always write a microbenchmark to verify. Just a simple kernel to copy data out of a 3D texture into global memory. Run it where thread read in each of the 3 directions and compare the timings.

And I’ll point out to clarify that only textures in 2D cudaArrays are laid out for 2D spatial locality. Textures bound to global memory and read with tex1Dfetch only have 1D locality.

Thanks for the greetings. The Bioinformatics program at Iowa state quite interests me so I might apply there in the future :D

And thanks for the quick reply as well. I have a second/third question regarding this, (I guess I should really make a benchmark of this myself anyway…) In the programming guide it says that spatial locality refers to texture addresses that are close together. My question what would be the definition of ‘close’, or what is the range that the texture cache considers when caching the data. One non normalized unit? enough range depending on the dimentionality for it to be at least 64 neighbors for the whole warp?

The other question. What if for example, I have two threads within a warp that access two texture memory locations that are not spatially close, (let’s say thread1 reads memory location (a1,a2) and thread2 reads (b1,b2)) However each of those threads succesively reads memory locations that do are spatially close to each one’s previous read (thread1 access now (a1+1,a2) and thread2 access (b1+1,b2) on a second read, and so on). Would they still benefit from the cache or would it be considered a cache miss?

I wouldn’t know. I’m in the physics dept.

I don’t really know for sure. This is one benchmark I keep meaning to write but never got around to it. This is just my intuition speaking but I would think “close” is that (speaking of 1D textures here) threads in a warp access elements within a 128 element wide window. Maybe a little bigger.

This is one thing I have benchmarked. Under ideal cases, the read of a1+1 and b1+1 later in the execution can be cache hits. Under most cases, they rarely are. And you can calculate why: here is how:

On GTX 280, we can process 32 warps on one MP. Lets say each thread reads a float4. That is 163232 = 16kiB of data! With an 8kiB texture cache and an 0th order approximation of round robin scheduling, any amount of variability in the texture coordinates read among the running warps will result in the values read by one warp flushed long before that warp comes to do another read. I believe this is the source of the comment in the programming guide that to optimize texture reads, one should make reads within a warp “close”. Also for this reason, I like to think of the texture cache as an “efficient uncoalesced memory reader”.

One other thing of note about the texturing hardware on the GPUs is that it can only serve so many texture reads per second (regardless of their size). Thus, (as a case pertinent to my code): three float reads from different textures is much slower than performing a single float4 read from one texture, despite the 25% bandwidth premium.

Could this be related to that the ptx manual says that the ‘tex’ instruction always returns 4 32 bit values? If so, then reading single floats would waste a terrible lot of bandwidth.

Hi, I’m starting with CUDA now and would like to understand better the performance bottlenecks, in particular with textures.

This is so, however if the kernel will be reading sequential data it might be possible to rearrange the computation to get sequential process read from sequential addresses. In the other case it is possible to read data ahead into shared memory (again by getting it done by sequential processes).

I figured out that when each warp reads sequential data (closely located), it is not important how distant the addresses would be for different warps – it will perform very well (about 20 cycles per warp) independent of that given maximal occupancy of MP.

In more details, I’ve conducted the following experiment:

    [*] linear texture from linear memory of size 32Mb

    [*] different warps were reading from random locations

    [*] threads in a warp read different locations

    [*] varied: number of active warps (occupancy) and how close are reads within a warp

    [*] measure: cycles spent by 1 MP per call of tex1Dfetch per warp

I have only one MP, so I can set up minimal occupancy by calling a kernel on single block of 32 threads – one warp and maximal occupancy in various configurations.

The results I observed:

minimal occupancy, very distant reads: ~2700 cycles per warp;

maximal occupancy, very distant reads: ~1000 cycles per warp;

minimal occupancy, sequential reads: ~300 cycles per warp;

maximal occupancy, sequential reads: ~20 cycles per warp.

Here “very distant reads” by threads of a warp are from locations more than 8K away. The performance varies smoothly as you make reads closer and also with occupancy. I also observed similar results with 2D textures.

I was wondering how this experience would agree with the guide, which says 400-600 cycles to complete read from global memory. There is not much math to hind the latency behind: only calculating the addresses – about 20 clock cycles per a texture read operation. Also in the worst case it is 2700 cycles to read 32 floats from random locations, so I would expect something like 32*400 = 12800 clock cycles. What do you think?

The device is Qadro NVS 160M - compute capability 1.1, one multi-processor. Here is the kernel with distant reads:

int delta_x = 45632*blockIdx.x+12345*threadIdx.x;

	

	const int blockSize = blockDim.x*blockDim.y*blockDim.z;

	const int gridSize = gridDim.x*gridDim.y;

	int x = delta_x;

	float R = 0;

	for(int i=0;i<niter;++i){

		x = (x+763+1234*blockSize*gridSize) & 0x1FFFFF;

		R = R + tex1Dfetch(texRef,x);

	};

	O_data[threadIdx.x] = R;

Constants are very ad-hoc, I’m not sure something periodic is not happening in there, but the texture size is large (2^23 floats) and I tried some other numbers randomly, essentially giving similar results.

Cool. Looks like you’ve done the detailed experiment based on the range of reads in a single warp that I never got around to doing myself. Any chance you might post a full code so we can try it ourselves on other hardware?

I have no clue. Cycle counting on GPUs, especially with the memory subsystem, is so prone to complications I never even attempt it. In particular for you calculation here, we have absolutely no idea what the details of the memory controller are: it could be capable of processing many of those 32 float reads simultaneously.

A much better metric and what really matters in the end is how much bandwidth over the full GPU can be pumped through the tex cache for these various read patterns.

@Knedlik,

The cycles lost depends on

  1. What is the frequency of the multi processor clock?
  2. What is the width and speed of the data-bus?

So 400-500 cycles is a very relative term possibly used by NVIDIA to frighten the programmers.
( i mean – for discouraging them from bad programming practices)

Good experiments though… Thanks.

Now that cudaBindTexture2D exists, since CUDA 2.2, do 2D textures bound to global memory have 2D locality?

2D textures bound to linear memory still make use of the texture cache, but won’t have the same cache behaviour as textures bound to arrays (which are stored in a special layout to optimize spatial locality).

I haven’t measured this, but I would expect 2D textures bound to linear memory to have fast access for nearby texels on the same row (X), but slower accesses to neighbouring rows (Y).

There were some things i was asking myself about texture units:

if i am correct, they are separate units from the memory controller that manages global memory… and they are also fixed function.
So, what is the correlation between global memory bandwidth and texture samples?

For example, if i read ONLY from textures, the TUs could become overloaded?
And if i read both from textures and global memory? The bandwidth should be doubled or it is limited by the memory bus?
If not, texture read + global write should be the best way to achieve maximum throughput…

And, if i read from global memory in a fully coalesced way, it is still convenient to switch to textures?
That is, coalesced access > texture cache?

After all CUDA should have more detailed documentation :rolleyes:

Not really. The “texture unit” is really just a small cache for reading from global memory (OK, it has built in fixed-function interpolation, but that is mainly for graphics). All read through textures still come over the same global memory bus: there is only one connection from the GPU to the memory.

Texture units can become overloaded even if you read from global memory and textures. The texture units can only supply so many “reads” per second, so if you make a lot of small reads you can hit this limit before running out of bandwidth in some cases. For example, it is almost always better to do one float4 texture read than to do 4 float texture reads.

It is the same bus, so you can’t achieve a higher bandwidth no matter how you slice it. There are only so many bits on the bus and so many clock ticks per second.

On modern hardware: coalesced access == texture cache reads (within fluctuations). If you go all the way back to compute 1.0 (8800 GTX / Tesla C870), then texture reads of 128-bit data types are 2x faster than coalesced reads of 128-bit data types. But that is a non-issue on anything newer.

Basically, the easiest way to think of the texture cache is as a way to efficiently read from global memory in an un-coalesced way.

Ok now that’s clear thanks External Image

Could anyone clarify why is this? Why was this difference in texture vs global?

I’ve also been wondering what is the reason for texture memory not being used for matrix multiplication

Any official information on the size of the texture cache in each Multiprocessor?

(it has been said here that it’s 8 KiB, but I’d like to be able to put a reference in my thesis)

About the “number of texture reads per second”, what value does exactly this number takes?

I mean, I’ve run MisterAnderson42’s bandwidthTest and I’ve always gotten a greater bandwith with the texture memory (I have a GTX 275).

What do you mean by this premium? When reading float4 instead of 4 floats? For me that was quite different from 25%:

GTX 275, 128 threads per block, lots of blocks

copy_gmem<float> - Bandwidth:	100.523779 GiB/s

copy_gmem<float2> - Bandwidth:	101.778525 GiB/s

copy_gmem<float4> - Bandwidth:	93.921721 GiB/s

copy_tex<float> - Bandwidth:	103.165003 GiB/s

copy_tex<float2> - Bandwidth:	106.296428 GiB/s

copy_tex<float4> - Bandwidth:	105.912863 GiB/s

write_only<float> - Bandwidth:	64.749816 GiB/s

write_only<float2> - Bandwidth:	65.091805 GiB/s

write_only<float4> - Bandwidth:	64.326884 GiB/s

read_only_gmem<float> - Bandwidth:	68.311705 GiB/s

read_only_gmem<float2> - Bandwidth:	92.205778 GiB/s

read_only_gmem<float4> - Bandwidth:	50.456122 GiB/s

read_only_tex<float> - Bandwidth:	68.385170 GiB/s

read_only_tex<float2> - Bandwidth:	93.662547 GiB/s

read_only_tex<float4> - Bandwidth:	86.357301 GiB/s

Yes, it is in the programming guide in Appendix A along with all of the other specifications.

If you look at the quote carefully, I said three floats. (i.e. I have xyz data and it is faster to read when packed into float4’s with dummy entires in the w component than to perform 3 separate float reads). You won’t see a difference in my bandwidth test program because it only performs a single read and write per thread. The specific benchmark where I get better performance with float4 vs 3 float reads makes ~100 semi-random xyz reads per thread.

Regarding the “number of texture reads per second”: I don’t know for certain. This is something that someone that knows more about the GPU hardware suggested to me as a reason for a benchmark a long time ago. I don’t really recall which thread it was in.

Edit: OK, the wording of that last sentence doesn’t really make sense… I’m too tired to be posting to the forums.

What I tried to say is that a while back I posted some strange performance results from a small benchmark. After a long discussion, someone that knows a lot more than the GPU hardware than me pointed out that the GPU can only serve so many texture fetches per second. That explained the results of the benchmark rather well. The thread is so old I have no idea where to find it now.