CUDA Memory Consistency

Hi,

Congratulations to nvidia on the official release of the promising CUDA, and the forums are of course a welcome addition.

I’ve a question about CUDA that’s been on my mind for a while. What is the CUDA memory consistency model? Do we have, say sequential consistency between threads with regard to their view of global memory?

Regards,
Paul

Thanks, and good question.

There is no ordering guarantee if two threads within a warp write to the same address (in global or shared memory) at the same time.

The only synchronization primitive supported currently is “__syncthreads”, which waits until all threads in a block have reached that point. Typically this is used to ensure that all writes to shared memory have finished before continuing execution.

It’s a simple progamming model, but there’s a lot you can do with it.

From the programming guide:

Thanks Simon. Perhaps I can provide a little more context to my question. If I have a thread which updates two global variables, ‘a’ and ‘flag’, in that order, i.e:

a = …
flag = 1;

am I guaranteed that ‘a’ and ‘flag’ will be updated, in the same order? Would another thread (possibly in another block), be able to rely on code such as:

while (flag == 0);

b = a;

Or might ‘flag’ possibly be updated before ‘a’?

Thanks again,
Paul

It’s unclear how “a” and “flag” are being defined. Are they shared? Are they device? Or are they automatic variables?

If they are automatic variables then each thread gets its own copy and they are stored in registers. Thus there are no concurrency issues.

If they are shared or device variables then you have no guarantees of concurrency or atomicity when multiple threads write to a variable simultaneously. The result is undefined.

Also, it’s impossible to synchronize between thread blocks, so your while loop wouldn’t work in general either within a block or between blocks.

Mark

Yes, I’d declare both global variables, ‘a’ and ‘flag’ with the device qualifier.

I understand the problems in updating a shared variable from multiple threads, but hopefully my example doesn’t present a race condition. Of the two threads, the first writes to ‘a’ and ‘flag’, while the second waits for ‘flag’ to change to nonzero. When 'flag does change, the second thread updates ‘a’. Is this example, with pseudocode as before, safe?

One reason this would not be safe, is if the memory model is not sequentially consistent. For example, the CUDA compiler might reorder the first thread’s update of ‘flag’ to occur before the update of ‘a’. This would certainly break the example, as the second thread could then potentially update ‘a’, without reflecting the first thread’s update.

Unfortunately, I don’t yet have the hardware; and I can’t rely on the emulator in this instance. Otherwise I’d try it, although experimentation can only go so far…

Many thanks,
Paul

No, as I wrote before, there are no guarantees of ordering. Any number of things in the program before you hit this point may have changed the order of thread execution.

For example, if one group of threads has to wait longer than another for a memory access, then the ordering may change.

Note, though, this has nothing to do with the compiler – the hardware schedules the threads while they are running.

Mark

Many thanks once again. I hope you can please forgive my persistence, but I feel I have still failed to express my difficulty.

I am not referring to the schedule, or order of thread execution. There are two initial points I am assuming, so perhaps one of these is wrong:

  1. A single thread can write to a global device integer.
  2. Any other single thread can read the same variable; and act according to its value.

Is this correct? From there I then ask something slightly different: If a single thread specifies that two integer device variables are to be updated, e.g a and b:


a = 27;
b = 39;

will the two writes to memory complete, in the same order(a then b); as observed by another thread?

Cheers,
Paul

Mark, he’s not asking about thread order, but operation order in the same thread. Assume there is only a single active thread. If there are two instructions in that thread, a = 1 and flag = 1, in that order in the CUDA code, which DRAM address changes first, and is that always the case? (Would the CUDA->ptx compiler reorder? Would the ptx assembler reorder? Would the hardware not maintain sequential properties?)

Assuming that is true, the actual use of a synchronisation operation such as that discussed should be fine. I would warn, of course, that performance would tend to be terrible as you’d be inefficiently DRAM bound, with no caching to help you in between. If you’ve ever developed for multi processor/node configurations then I’m sure you’re aware of that kind of issue.

Also because all processors are running the same code (but not, at the block level, necessarily at the same point in the code) you might find that some fancy conditionals are very careful use of syncs are needed.

Placing a __syncthreads between the two writes would probably be enough to guarantee order, but would obviously lose you 10s of cycles in the process (it’s an interesting approach to reducing register usage as well when the performance hit is tiny).

Lee

Ah, I see. I think we can safely assume that if it would cause incorrect results, the compiler will not re-order the execution of the instructions. If it does, it’s a bug. Instruction order that doesn’t affect correctness may be changed by the compiler for optimization purposes – just like any compiler.

As an aside, I’ll also say that I think it’s dangerous to think about execution order “within a single thread”. The GPU is a data-parallel processor, and CUDA is a data-parallel language. The only time you should think about a single thread is when

a ) The thread block size is 1 (This should be rare because it would be a horribly inefficient use of the hardware).

b ) You have isolated a single thread using code like this (assuming a 1D thread block):

if (threadIdx.x == 0) { ... }

This latter method is useful for updating shared variables, which might be what pkeir originally meant.

At all other times, the execution order of instructions is the same for all active threads in a block (where active threads can be affected by branching), so its best to think in parallel.

Mark

I understood it as your case b. I can see cases where you might want a single thread to update a variable for the entire block (easy, sync, write, sync) or for the entire execution (using the locking approach pkeir’s suggesting here) . In the latter case it won’t be efficient, but for a long enough execution it might be efficient enough depending on the circumstance.

Thanks Lee, Mark. It is case b I’m meaning. I know it’s not efficient, but it is also only for rare occasions. (Incidentally, as you may have assumed, no thread other than the two being considered alters ‘flag’ in the example.)

...

// the following example excerpt is compiled as a CUDA kernel

if (tid == 136)

{

  a = 7;

  flag = 1;

}

if (tid == 203)

{

  while(flag == 0); // Do nothing

  a++;

}

...

A sequential compiler might exchange the a = 7 and flag = 1 instructions above with each other for various reasons. I think Mark is suggesting the CUDA compiler will see the dependency, and would not do such an exchange. Great. I need to be certain of a result like that.

Cheers,

Paul

I make no promises. I’m not a compiler expert – I’m not sure what the CUDA compiler will do there. Have you tried it?

Mark

I’m glad you asked - yes, I’ve just coded up that simple example. I’ve left things much as they are in template.cu or such like. I’m also no compiler expert.

I’m using floats, and I check for zero as ‘var < 0.5f’; External Image a bit paranoid maybe but there you go. ‘a’ is now g_data[ 1 ] and ‘flag’ is g_data[ 0 ]. It uses printf to display the result, which should be 8 (7+1). I decided to use two blocks.

On the emulator, it works under both EmuRelease and EmuDebug. I wonder how much of the compilation path the emulator shares with the device compiler though. Perhaps someone would be kind enough to let me know if it does run OK on the device.

Ciao,

Paul

#include <stdio.h>

#include <stdlib.h>

#include <cutil.h>

#define NTHREADS 128

__global__ void testKernel( float *g_data );

#define SDATA( index )	CUT_BANK_CHECKER( sdata, index )

int main( int argc, char **argv )

{

    CUT_CHECK_DEVICE();

   unsigned int timer = 0;

    CUT_SAFE_CALL( cutCreateTimer( &timer ) );

    CUT_SAFE_CALL( cutStartTimer( timer ) );

   unsigned int mem_size = sizeof( float ) * NTHREADS;

   float* h_idata =  (float *)malloc( mem_size );	// allocate host memory

    for( unsigned int i = 0; i < NTHREADS; ++i ) // initialise the memory

    {

        h_idata[ i ] = 0.0f;

        printf( "%1.1f ", h_idata[ i ] );

    }

   float* d_data;

    CUDA_SAFE_CALL( cudaMalloc( (void **)&d_data, mem_size ) );	// allocate device memory

    // copy host memory to device

    CUDA_SAFE_CALL( cudaMemcpy( d_data, h_idata, mem_size, cudaMemcpyHostToDevice ) );

   // setup execution parameters

    dim3  grid( 2, 1, 1 );

    dim3  threads( NTHREADS, 1, 1 );

   CUT_CHECK_ERROR( "Kernel execution failed" );  	// check if kernel execution generated an error

   testKernel<<< grid, threads, mem_size >>>( d_data );	// execute the kernel

   CUT_CHECK_ERROR( "Kernel execution failed" );  	// check if kernel execution generated an error

   float* h_odata = (float*) malloc( mem_size );  	// allocate mem for the result on host side

   // copy result from device to host

    CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_data, mem_size, cudaMemcpyDeviceToHost ) );

   printf( "\n\na is %1.1f flag is %1.1f (a should be 8.0) \n\n", h_odata[ 1 ], h_odata[ 0 ] );

   //for( unsigned int i = 0; i < NTHREADS; ++i ) // initialise the memory

    //	printf( "%1.1f ", h_odata[ i ] ); printf( "\n" );

   CUT_SAFE_CALL( cutStopTimer( timer ) );

    printf( "Processing time: %f (ms)\n", cutGetTimerValue( timer ) );

    CUT_SAFE_CALL( cutDeleteTimer( timer ) );

   free( h_idata );

    free( h_odata );

    CUDA_SAFE_CALL( cudaFree( d_data ) );

   return 0;

}

__global__ void testKernel( float *g_data )	// pars are __global__

{

  extern  __shared__  float sdata[];	// shared memory; the size is determined by the host application

 const unsigned int tid = threadIdx.x;  	// access thread id

  const unsigned int bid = blockIdx.x;  	// access block id

  //const unsigned int num_grids = gridDim.x;  // dim3  grid( 2, 1, 1 );

  //const unsigned int num_threads = blockDim.x;	// access number of threads in this block

 if ( tid == 0 && bid == 0 )

  {

    g_data[ 1 ] = 7.0f;	// a = 7

    g_data[ 0 ] = 1.0f;	// flag = 1

  }

  else if ( tid == 0 && bid == 1 )

  {

    while( g_data[ 0 ] < 0.5f );	// while ( a == 0 );

    g_data[ 1 ]++;    // a++;

  }

}

Not a great deal, i’d think. Try building with nvcc --ptx and look at the ptx assembly you get out from that. It’s the intermediate assembly language so it’s just possible that the assembler will change the ordering again (no idea if it does that sort of thing) but at least you’ll see that the compiler hasn’t. Unfortunately, of course, trying random examples is no guarantee that it’ll always work and in your example there is no real dependency as far as I can see, the compiler would have to assume that the order you did things is the order you want things done. Maybe you could try putting complicated address calculations into both of those assignment statements, that kind of thing might persuade the compiler to change the statement order if it thinks there’s no dependency to reduce the register requirement.

I tried it in hardware. With the comparison as you have it it hangs. Changing to

while( g_data[ 0 ] > 0.5f ); // while ( a == 0 ); it works fine (so it’s not clear it’s ever seeing a 0 at all). Also, running it a second time doesn’t work at all because nothing ever initialises the memory to 0. Actually, I’m a bit confused, but I only had a very quick glance while I work on some other stuff here :)

Thanks for that. The zeroing of ‘flag’ is done by the host (cudaMemcpyHostToDevice). If the kernel was to be run more than once, I’d perhaps reset the flag (g_data[ 0 ] = 0.0f) after incrementing ‘a’.

But it’s very interesting that you say it hangs. Even on the first run? Can you see why changing the comparison helped things?

Assuming the code does actually work on the device, let’s pick up your suggestion: I’m not yet sure though, what I’d change to encourage the compiler to make a dependency error. You think the compiler could still have got lucky with such a simple example?

Then again, I do understand that such codes will sadly never give a general answer.

Paul

OK, I was being stupid and misunderstanding what you are trying to do. You cannot use global memory to communicate between threads. Since global stores are “fire and forget”, you cannot insure ordering of global writes with __syncthreads(). To do this, your “flag” must be in shared memory.

Another way of saying this is: you cannot communicate or synchronize between thread blocks, only within thread blocks.

Mark

Mark: That is not correct I think… :) The following code will work just fine for synchronizing between blocks - assuming, of course, that all the necessary blocks are currently running on a multiprocessor. Assuming ‘a’ and ‘flag’ are in device memory. Also, ‘b’ is a temporary variable.

[BLOCK 1]a = b;

while(a != b);

__syncthreads();

flag = 1;

[BLOCK 2]

while(flag != 1);

__syncthreads();

b = a;

// b now has the same value as in block 1!

This code should, as far as I can see, guarantee proper data sharing and synchronizing between simultaneously running blocks. This works because reads are not cached. I didn’t test it though, but I can’t see why it wouldn’t work. I’m using __syncthreads to reduce the odds that a current or future compiler bug would move stuff around. In theory, it could be removed, afaics.

Of course, this is not very efficient, but I doubt that’s the point of the thread. Ideally, direct read/writes to L2 would be permitted, which would significantly reduce the performance hit of this because the latency is lower. I’m not sure it’s really worth the trouble for future CUDA revisions, since it has its own share of problems (variability of L2 size and number of multiprocessors between GPUs, for example…) but it could be interesting.

Sorry for bumping back this old thread too, I just thought this answer might help someone… :)

Arun, what if you have many blocks running the code you have in [Block 2], and anywhere from 1 to hundreds running the code in [Block 1]?

There is no guarantee on G80 that 16 copies of Block 2 won’t run first, thus hanging the GPU.

To reiterate: you should not try to synchronize using global memory. Sure, simple cases like this may work sometimes, but they are not guaranteed to work, and could break in the future even if they work now.

Mark

Yes, as I said above, you need to take into consideration which blocks are running at any given time. Given the number of multiprocessors, you have access to that information, of course. Thus, if you are targetting a specific chip, rather than CUDA in general, this scheme should be viable.

I can see why you say it’s not supported though, since if people used it without fully understanding its implications, you’d have a lot of unjustified bitching aimed your way ;)

Arun: Many thanks. Your response really helped me understand what may be wrong with my code from post#13: The variables by default perhaps only begin their flush from registers into global memory at a kernel’s close. So, when thread zero of block zero encounters the instruction “g_data[ 0 ] = 1.0f”, it merely updates a register; to be updated to global memory only when the kernel returns; a thing which never happens. This may be what Xrikcus was noticing in his post #14.

As your code stands I think it is very similar to my own, with the distincion being that the thread which makes the first write is also the one doing the while test; hoping its change is now visible in global memory. I think that could fail just like like my own, as both references, a and b, could be served from a register. For the same reason, Block 2’s flag variable might never be served with an updated value.

Perhaps though __syncthreads() has one other use for us besides restricting reordering; as a flush - triggering an update to global memory. How about putting one more __syncthreads() call before the “while(a != b)” line, a new one after “flag = 1”, and one more inside Block2’s while loop? Might that trigger the necessary writes to global memory?

As I say I don’t have the hardware, so please forgive my slow progress. Thanks again for the help. Thanks to Mark too.

Paul