I’m writing an iterative kernel that has a reduction as its last step.
I’m currently re-queuing the kernel as the threads get out of sync and goto the next iteration while the reduction is still ongoing and thus the free threads behave incorrectly.
I’m thinking of just reducing the threads so it runs on just one core so the threads stay synced and I can iterate on the GPU rather then the CPU.
Any advice? Does CUDA have this same problem?