howto resync all threads after a reduction


I’m writing an iterative kernel that has a reduction as its last step.

I’m currently re-queuing the kernel as the threads get out of sync and goto the next iteration while the reduction is still ongoing and thus the free threads behave incorrectly.

I’m thinking of just reducing the threads so it runs on just one core so the threads stay synced and I can iterate on the GPU rather then the CPU.

Any advice? Does CUDA have this same problem?


If you need to sync on the ND-range level then your only choice is to sync at the CPU level (queue another kernel). If you need to sync at the work group level then you can use a barrier instead