texture-cache miss ...what happens to the warp?

Hi there,

I’m working on a particle simulation program where each thread calculates the interactions of one particle with a (pre-defined) set of neighbors. The particle positions (float4s) are bound to a texture. Due to the large number of threads and the small cache, it is very likely that the memory request of at least one thread in a (half-)warp is a cache miss.

my questions:
let’s say one mem. request from texture cache is a cache miss, what happens to the rest of the warp? Wait until all positions are available? Diverge?

has anybody found some documentation of how the texture cache works exactly? size of cache-lines? what/how the data is replaced? etc.

thanks in advance

Let us safely assume that – TEX fetch() is actually a function that gets implemented as some “N” instructions.

The WARP proceeds executing these instructions one by one at the same time.

So, when one of the thread’s data in the WARP results in a cache-miss, all other threads have to wait for that data to come. Because the WARP cannot proceed to the next instruction because if it proceeds leaving out a thread then it is not a warp…

So, I would assume that the data from all threads of a WARP need to arrive before the WARP can march ahead further.

However, if NVIDIA’s hardware was smart enough to execute out-of-order execution then it might be possible for the WARP to proceed to next instruction without the data coming in from the texture cache… – but that is immaterial to the programmer.

The outside look to the programmer is that all threads in the WARP wait for all data to come from the texture bfore proceeding further.

The only sensible thing for it to do is to wait for the requested memory to be available and then continue. I can’t say that I’ve seen this behavior documented explicitly, though.

Certainly the NVIDIA people working on CUDA know this ;) Of course, they won’t tell any of us anything about it (trust me, I’ve asked them face to face).

I don’t see how knowing those details would make much of a difference, though. With many undefined warps all executing concurrently on the same MP you don’t have much control over how the memory is accessed from the cache as a whole. The best you can do is to have data-local accesses within each warp. If you do that, the device will reward you with the full device mem bandwidth available.

You mention that you have threads accessing neighbors of a particle. This is right up my alley, as the key routine in my CUDA app does exactly this. To improve the data locality, I apply a sort routine to the particles rearranging them in memory so that the neighbors of a particle have indices near by. The full details are in our paper: http://dx.doi.org/10.1016/j.jcp.2008.01.047 (or you can get the preprint from www.ameslab.gov/hoomd ). This sort is only a small portion of the paper detailed in section 2.3 Particle Sort

Thank you for your answers.

I was thinking about applying some data-sorting algorithm. that was the reason for my questions ;-)

from a probabilistic point of view: ~64KB of texture-memory holds about positions of 4K particles. I’m working with a GT200 -> 240 cores * 64 threads per block => i’m “working” on 15K particles

=> the probability to have at least one cache miss is very high…

do I make an error of thought?

@MisterAnderson42: I read the paper you mentioned: ever tried the code on a GT200?

thank you in advance.

That is why it is so important to get good data locality within the accesses made across the threads of a warp. Pretty much as soon as that warp is done with it’s read another will be coming along to fill that spot in the cache.

Yep. HOOMD runs about 1.5x faster on a G200 compared to G80 (270 TPS vs 186 TPS in the Lennard-Jones liquid benchmark). Note that the current release on the web has some issues that prevent it from running on G200. The development version is working fine, though, so it will be fixed in the next release.

If a thread in one warp has a cache miss, the whole warp must wait. (The threads in a warp are siamese twins. Even if the warp “diverges,” it’s just an illusion.) But if one warp has a cache miss and the others don’t, only the one warp stalls and the others pick up the slack.

EDIT: no, no, I have it all wrong. Accessing the texture cache is the same latency as accessing global memory itself. (This is by design. The tex cache is fn stupid.) So there is no such thing as “waiting for a cache miss” :) You just end up using DRAM bandwidth, which causes a slowdown only if you use too much.

I personally think the main use of texture cache is to get GTX2xx-style coalescing behaviour, it is too small to get much of an advantage due to actual caching (if you have really local accesses you would be using shared memory to implement your own caching after all).

That’s a good way to think about it, except that the texture cache is much more efficient at handling random, but semi-local accesses. I just benchmarked the kernel in my app (mentioned in an above post) that reads in the neighboring particles and calculates forces on a GTX 280. With the 1D texture: it completed in 1.88 ms. Switching over to straight global mem reads: 4.03ms. So basically there is a factor of two difference between them for my memory access pattern. (IIRC, it’s closer to a factor of 5-10 on compute 1.0 hardware, so the GTX 280 coalescing is helping out here quite a bit).

I was going to mention all the stuff I’ve discovered about the tex cache in microbenchmarks, but all the points have already been hit on here. I’ll just reiterate that what it all boils down to is getting the best performance out of the tex cache means getting data-local accesses within each warp, which can be thought of like the GTX 2xx-style coalescing as was pointed out, but better at handling more random accesses.