How to write global memory with last thread?

siri · February 16, 2020, 12:52pm

Hi I have a kernel as flowing:
global void(float ga, float in, float b, float *out, int num)
{
float tmp = *ga;
for(int id = threadIdx.x + blockIdx.x * blockDim.x; i < num; i += blockDim.x * gridDim.x) {
out[id] = tmp * in[id] * b;
}
if (id == num - 1) ga[0] = b * b;
}

so can last thread write back to ga, and not affect “tmp”?

Thanks in advance.

Robert_Crovella · February 16, 2020, 2:37pm

No, there is really no guarantee of that in CUDA. Threads can execute in any order. The last thread may execute first, which will affect tmp for all other threads.

siri · February 16, 2020, 2:47pm

so sad…, but thank you.

Robert_Crovella · February 17, 2020, 1:05am

You can use synchronization to create ordering. At the block level, you can use __syncthreads(). At the grid level, you may be able to use CUDA cooperative groups with a grid-wide sync. Finally, for algorithms where the last block/thread needs to do something special, you could also use an atomic/block-draining approach as identified in the cuda threadFenceReduction sample code.