Memory programming model of Fermi

Something I am wondering for quite some time: Fermi is supposed to have per-multiprocessor 1st-level read/write-cache and this cache is not coherent over the set of multiprocessors. So what happens when different multiprocessors write at the same time to different variables which reside in the same cache line. I assume the cache lines are larger than 32 bits, so the problem should exists even for float variables, but e.g. for chars the problem should certainly arise.

void global foo(char* data) {
if (threadIdx.x==0) {

I couldn’t find anything in the programmers guide 3.0 about this. Is the code above supposed to work? If yes, how does the hardware handle it?

Although it isn’t clear to me what will happen, answer is probably implicitly tucked away in here.

I would suspect that it would work as intended, otherwise it would introduce significant problems for existing programs. In hardware, you could implement it with a write-through L1 cache where all writes would update and bypass the L1 and update the L2 on a byte-granularity rather than a cache-line granularity. You could also add in byte-masks for cache-lines in the L1 and retain a write-back scheme at the cost of an additional 1/8th size overhead for the data segment of the L1.

I already read it. Didn’t find anything useful on this point.

Yes, as nothing is mentioned anywhere, in theory this means it is supposed to work. I hope someone knows something more definite, though.

I suspect the __threadfence_system() call forces a write of all dirty caches back to device memory. If the incoherent cache lines collide because of blocks writing to the same addresses, then one of the blocks wins, but which one is undefined. (It’s your fault for making the colliding writes!) As Greg said, there may be some byte-level bitmasks to allow partial writes of cachelines on a per-byte level… that must be annoying to implement for NV but it would mean that different blocks writing to different bytes would not collide even if the bytes are on the same cacheline.

This is just my theory, the docs aren’t detailed enough to tell yet. But how else could they do it?

That’s not what __threadfence_system() does at all!


Right. The documented behavior of __threadfence_system() is to have the caller wait until all memory accesses of that SM are visible to all threads on the device (and host for zero copy memory.)

That’s all that’s said about it in the Programming Guide beta.

So my guess, reading into the behavior, was that this by side effect forces the dirty L2 cachelines are written back to the device… how else could the data in the calling SM’s L2 be visible to the other SMs after __threadfence_system() promises?

Actually plain __threadfence() should have a similar promise, the threadfence_system() just additionally extends that to zero-copy which probably is incidental.

L2 is common to all SM’s … Kind of like L3 on Nehalem

Gregory’s answer looks the most favorable.

The system will gaurantee that “data[blockIdx.x]” – that was written by the current block – will be visible to all threads running on the MP.
But if you want to snoop “data[blockIdx.x] + 1” – then, it would probably force a cache-line-flush(with byte-enables) onto L2 and then re-load stuff from L2.

I don’t think the memory consistency model requires any of the caches to be coherent except at __threadfence_system(), which is one of the most significant advantages of CUDA. In your example, the final results would have the updated values from other blocks, but they would not be visible immediately to threads in other blocks. If __threadfence_system() is not intended to be a common operation, they could just flush the L1s entirely.

THanks for the clarification… Its great to know about it. I have not read through fermi spec.

I can also see the advantage of the loose coherency as you point out… Thanks,

I guessed the same earlier in this thread, but tmurray’s reply implies that it may not be that simple.

Well, I don’t think that his comment was in regards to the memory consistency model (please correct me if I am wrong), but rather the function of __threadfence_system. For which I can’t seem to find any documentation.