Memory programming model of Fermi

Oli11 · March 16, 2010, 9:50am

Something I am wondering for quite some time: Fermi is supposed to have per-multiprocessor 1st-level read/write-cache and this cache is not coherent over the set of multiprocessors. So what happens when different multiprocessors write at the same time to different variables which reside in the same cache line. I assume the cache lines are larger than 32 bits, so the problem should exists even for float variables, but e.g. for chars the problem should certainly arise.

void global foo(char* data) {
if (threadIdx.x==0) {
data[blockIdx.x]++;
}
}

I couldn’t find anything in the programmers guide 3.0 about this. Is the code above supposed to work? If yes, how does the hardware handle it?

avidday · March 16, 2010, 10:12am

Although it isn’t clear to me what will happen, answer is probably implicitly tucked away in here.

Gregory_Diamos · March 16, 2010, 1:31pm

I would suspect that it would work as intended, otherwise it would introduce significant problems for existing programs. In hardware, you could implement it with a write-through L1 cache where all writes would update and bypass the L1 and update the L2 on a byte-granularity rather than a cache-line granularity. You could also add in byte-masks for cache-lines in the L1 and retain a write-back scheme at the cost of an additional 1/8th size overhead for the data segment of the L1.

Oli11 · March 16, 2010, 1:47pm

I already read it. Didn’t find anything useful on this point.

Yes, as nothing is mentioned anywhere, in theory this means it is supposed to work. I hope someone knows something more definite, though.

SPWorley · March 16, 2010, 7:04pm

I suspect the __threadfence_system() call forces a write of all dirty caches back to device memory. If the incoherent cache lines collide because of blocks writing to the same addresses, then one of the blocks wins, but which one is undefined. (It’s your fault for making the colliding writes!) As Greg said, there may be some byte-level bitmasks to allow partial writes of cachelines on a per-byte level… that must be annoying to implement for NV but it would mean that different blocks writing to different bytes would not collide even if the bytes are on the same cacheline.

This is just my theory, the docs aren’t detailed enough to tell yet. But how else could they do it?

tmurray · March 16, 2010, 7:12pm

That’s not what __threadfence_system() does at all!

:)

SPWorley · March 16, 2010, 7:30pm

Right. The documented behavior of __threadfence_system() is to have the caller wait until all memory accesses of that SM are visible to all threads on the device (and host for zero copy memory.)

That’s all that’s said about it in the Programming Guide beta.

So my guess, reading into the behavior, was that this by side effect forces the dirty L2 cachelines are written back to the device… how else could the data in the calling SM’s L2 be visible to the other SMs after __threadfence_system() promises?

Actually plain __threadfence() should have a similar promise, the threadfence_system() just additionally extends that to zero-copy which probably is incidental.

jma · March 16, 2010, 10:15pm

L2 is common to all SM’s … Kind of like L3 on Nehalem

Sarnath · March 18, 2010, 12:30pm

Gregory’s answer looks the most favorable.

The system will gaurantee that “data[blockIdx.x]” – that was written by the current block – will be visible to all threads running on the MP.
But if you want to snoop “data[blockIdx.x] + 1” – then, it would probably force a cache-line-flush(with byte-enables) onto L2 and then re-load stuff from L2.

Gregory_Diamos · March 19, 2010, 6:40am

I don’t think the memory consistency model requires any of the caches to be coherent except at __threadfence_system(), which is one of the most significant advantages of CUDA. In your example, the final results would have the updated values from other blocks, but they would not be visible immediately to threads in other blocks. If __threadfence_system() is not intended to be a common operation, they could just flush the L1s entirely.

Sarnath · March 19, 2010, 9:50am

THanks for the clarification… Its great to know about it. I have not read through fermi spec.

I can also see the advantage of the loose coherency as you point out… Thanks,

SPWorley · March 21, 2010, 11:09pm

I guessed the same earlier in this thread, but tmurray’s reply implies that it may not be that simple.

Gregory_Diamos · March 22, 2010, 1:42am

Well, I don’t think that his comment was in regards to the memory consistency model (please correct me if I am wrong), but rather the function of __threadfence_system. For which I can’t seem to find any documentation.

Topic		Replies	Views
Fermi L1 Cache coherent? CUDA Programming and Performance	5	14913	May 20, 2010
__threadfence_block() vs __threadfence() ? CUDA Programming and Performance	6	6757	July 13, 2022
Fermi Cache Architecture Cache, write policy, read policy, architecture CUDA Programming and Performance	6	3221	August 31, 2011
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23544	March 21, 2011
Question related __threadfence CUDA Programming and Performance	13	5082	January 12, 2016
How to optimize for cache + shared memory on Fermi? CUDA Programming and Performance	8	3043	April 25, 2010
Block synchronization issue on Fermi Global memory doesn't seem to by in sync across thread bloc CUDA Programming and Performance	18	3196	April 7, 2011
difference between __threadfence_block and __syncthreads CUDA Programming and Performance	17	29289	April 22, 2015
Fermi question CUDA Programming and Performance	30	5553	May 26, 2010
How to keep L1 and L2 cache consistent CUDA Programming and Performance	1	1332	October 27, 2011

Memory programming model of Fermi

Related topics