How to implement a lock?

I want to lock a variable (share/global memory, int/float/double/ etc), other threads in the same block can’t access it unless it’s unlocked.

Any idea? Thanks.

Sure, just use a syncthreads() block to isolate the thread you want to have control.

Note this exclusion is thread based, not variable based, so you just need to make sure only to modify those variable(s) inside the blocked region.

__syncthreads(); // very important !

if (0==threadIdx.x)  {

important_shared_memory_variable+=2; 

    other_delicate_value=0; 

    // etc, put whatever critical atomic changes you like here

}

__syncthreads();  // very imporant

This kind of structure is used a lot, especially for initialization of block-wide values at the start of kernels.

CUDA does not provide a simple mechanism for a fully general mutex lock, in part because algorithms that rely on them tend to underutilize the GPU hardware due to the warp-level nature of instruction scheduling.

Mutexes are often used to implement different kinds of other synchronization primitives, so you have to instead consider what you want at a higher level. For example:

  • single thread critical section: This is the example SPWorley gave. You have a small section of code that should only be executed by one thread, and this section of code uses results from other threads in the block, or outputs results used by other threads. Then you use the __syncthreads() barrier around the if statement to avoid a race condition. This is good for small sections of “serial” code in a kernel. It doesn’t work so well when you can’t group your exclusive-access code into a critical section, or only block threads when they access specific variable.

  • atomic updates of a variable: This only works for simple data types in CUDA, and you use the atomicAdd/Inc/etc() functions.

  • scatter/gather between threads in a block: A more general version of the single thread case. Here you have to separate the part of your kernel that writes to shared memory from the part that reads with a __syncthreads() barrier. (or vice versa)

Unfortunately, this doesn’t come close to everything you can do with a mutex, so you might need to go one level up and think about how to design away the need for a mutex entirely, or transform the mutex into a primitive that CUDA does provide. An example of this is switching from one thread per input data element to one thread per output data element. By giving each thread “ownership” of a different output data element there is no possibility of conflicting writes.