Deliberate race condition

siddok · January 13, 2025, 1:24pm

I have a very simple task – I want to know if a certain logic was executed by at least one thread in a threadblock. Typical size of my threadblock is 256 threads. Most of the time all of threads do this part, but some threadblocks have 0 threads going there.

because I need at least one logic, I’m trying to use the following logic:

uint8_t shared_memory[10]
if (thread_id < 10) shared_memory[thread_id] = 0;

threadblock_sync();

for (int i = 0; i < 10; i++) {
   bool a = false;
   // logic that may (and most likely transform a)

   if (a) {
      shared_memory[i] = 255;
   }
}

threadblock_sync();

if (thread_id < 10) 
    global_memory[uniform_offset + thread_id] = shared_memory[thread_id];

So I’m basically introducing a deliberate race condition to shared_memory. It kinda works but I want to confirm that it won’t lead to any unexcpeted things. One thing to note - I don’t mind if memory will corrupt and instead of 255 some items will be 18 or whatever, I just don’t want them to be 0.

I’d appreciate if somebody can confirm this approach is legit and can be used in produciton.

Curefab · January 13, 2025, 2:11pm

As you are using threadblock_sync() twice, which in turn probably does something like __syncthreads(), there is no issue between reading and writing.

The only remaining race condition would be that several threads could write, even at the same time.

If I remember correctly the specific logic for shared memory, you have the guarantee that you read the value written by one of the threads. Not necessarily in any kind of order, but you would either read 0 (no thread has written) or 255 (at least one of the threads has written).

So that is a perfectly valid program.

For something more involved (e.g. counting) you can use atomicAdd, which is quite fast on shared memory in recent architectures.

siddok · January 14, 2025, 8:18am

thank you for your reply! originally I came up with this idea as a performance trick (alternative is to use atomic_or and warp reduction), do you think it is viable or maybe even worse due to heavy bank conflicts?

Curefab · January 14, 2025, 10:20am

According to the documentation, what you do is valid (only the thread doing the final write is undefined).

If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device (see Compute Capability 5.x, Compute Capability 6.x, and Compute Capability 7.x), and which thread performs the final write is undefined.

A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank). In that case, for read accesses, the word is broadcast to the requesting threads and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined).

I think one can find similar guarantees and non-guarantees for threads from different warps.

striker159 · January 14, 2025, 10:44am

I guess another simple solution could be using __syncthreads_or, although I have not used it before.

__global__
void kernel(const int* input, int* output, int N){
    const int tid = threadIdx.x + blockIdx.x * blockDim.x;
    const int stride = blockDim.x * gridDim.x;

    int seen42 = 0;
    for(int i = tid; i < N; i += stride){
        if(input[i] == 42){
            seen42 = 1;
        }
    }
    int blockseen42 = __syncthreads_or(seen42);
    if(threadIdx.x == 0){
        output[blockIdx.x] = blockseen42;
    }
}

Topic		Replies	Views
thread writing into global memory (thread sync) CUDA Programming and Performance	2	1624	August 23, 2009
no thread with threadIdx.x==0 && threadIdx.y ==0? CUDA Programming and Performance	3	11819	March 14, 2011
Thread memory concurrency within the same block? CUDA Programming and Performance	12	1598	September 29, 2010
Does __syncthreads not work across multiple warps? CUDA Programming and Performance	9	3466	April 30, 2014
shared memory writes CUDA Programming and Performance	6	3221	December 30, 2007
how to avoid race condition? CUDA Programming and Performance	7	5646	October 23, 2009
apparent shared memory race condition despite using syncthreads CUDA Programming and Performance	5	14453	July 15, 2010
Doesn't this write to the same thread? CUDA Programming and Performance	7	1291	October 2, 2014
Shared memory race condition despite using syncthreads Legacy PGI Compilers	2	6146	July 1, 2010
Multiple writes to global memory CUDA Programming and Performance	2	2195	May 6, 2008

Deliberate race condition

Related topics