How to write global memory with last thread?

Hi I have a kernel as flowing:
global void(float ga, float in, float b, float *out, int num)
float tmp = *ga;
for(int id = threadIdx.x + blockIdx.x * blockDim.x; i < num; i += blockDim.x * gridDim.x) {
out[id] = tmp * in[id] * b;
if (id == num - 1) ga[0] = b * b;

so can last thread write back to ga, and not affect “tmp”?

Thanks in advance.

No, there is really no guarantee of that in CUDA. Threads can execute in any order. The last thread may execute first, which will affect tmp for all other threads.

so sad…, but thank you.

You can use synchronization to create ordering. At the block level, you can use __syncthreads(). At the grid level, you may be able to use CUDA cooperative groups with a grid-wide sync. Finally, for algorithms where the last block/thread needs to do something special, you could also use an atomic/block-draining approach as identified in the cuda threadFenceReduction sample code.