I have a code in which threads from different blocks could sometimes be writing to the same memory location. To be on a safer side, I want to synchronize the threads from all blocks. When I do it by killing the kernel, the output is correct but there is a lot of overhead. I thought stream programming could be a solution. But given my application, it would be a solution only if different cudaStreamkernels (i.e. kernels belonging to different streams performing different operations) could be launched. But this doesn’t seem like possible apparently (I see that different streams would operate on the same global device_kernel). Is there a way I could do it? Some sort of switches or any other provision?
Thanks & regards,