Deliberate race condition

I have a very simple task – I want to know if a certain logic was executed by at least one thread in a threadblock. Typical size of my threadblock is 256 threads. Most of the time all of threads do this part, but some threadblocks have 0 threads going there.

because I need at least one logic, I’m trying to use the following logic:

uint8_t shared_memory[10]
if (thread_id < 10) shared_memory[thread_id] = 0;

threadblock_sync();

for (int i = 0; i < 10; i++) {
   bool a = false;
   // logic that may (and most likely transform a)

   if (a) {
      shared_memory[i] = 255;
   }
}

threadblock_sync();

if (thread_id < 10) 
    global_memory[uniform_offset + thread_id] = shared_memory[thread_id];

So I’m basically introducing a deliberate race condition to shared_memory. It kinda works but I want to confirm that it won’t lead to any unexcpeted things. One thing to note - I don’t mind if memory will corrupt and instead of 255 some items will be 18 or whatever, I just don’t want them to be 0.

I’d appreciate if somebody can confirm this approach is legit and can be used in produciton.

As you are using threadblock_sync() twice, which in turn probably does something like __syncthreads(), there is no issue between reading and writing.

The only remaining race condition would be that several threads could write, even at the same time.

If I remember correctly the specific logic for shared memory, you have the guarantee that you read the value written by one of the threads. Not necessarily in any kind of order, but you would either read 0 (no thread has written) or 255 (at least one of the threads has written).

So that is a perfectly valid program.

For something more involved (e.g. counting) you can use atomicAdd, which is quite fast on shared memory in recent architectures.

thank you for your reply! originally I came up with this idea as a performance trick (alternative is to use atomic_or and warp reduction), do you think it is viable or maybe even worse due to heavy bank conflicts?

According to the documentation, what you do is valid (only the thread doing the final write is undefined).

If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device (see Compute Capability 5.x, Compute Capability 6.x, and Compute Capability 7.x), and which thread performs the final write is undefined.

A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank). In that case, for read accesses, the word is broadcast to the requesting threads and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined).

I think one can find similar guarantees and non-guarantees for threads from different warps.

I guess another simple solution could be using __syncthreads_or, although I have not used it before.

__global__
void kernel(const int* input, int* output, int N){
    const int tid = threadIdx.x + blockIdx.x * blockDim.x;
    const int stride = blockDim.x * gridDim.x;

    int seen42 = 0;
    for(int i = tid; i < N; i += stride){
        if(input[i] == 42){
            seen42 = 1;
        }
    }
    int blockseen42 = __syncthreads_or(seen42);
    if(threadIdx.x == 0){
        output[blockIdx.x] = blockseen42;
    }
}