Hi everyone,
I am making a cuda project, where I want to implement a local thread synchronization instead of the commonly used function __syncthreads(). My task is done by 128 threads and each of them need to do several stages of operations.
Why is the local synchronization appliable is that in each stage, every 2 threads have data dependencies on their last stage. So I only need to let them 2 threads to wait, instead of letting all threads wait. I create an array of type bool to save the status of the thread.
I have tested the function, and it runs well when doing this with kernel function with 1 block of 128 threads. But if I deploy more blocks, for example 256 blocks with 128 threads each. The results comes wrong and instable, which means the synchronization has not run correctly. Does anyone have idea about the cause? Many thanks.
Why would you do a synchronization with less than warp granularity? You can use the barrier instruction to do synchronization with only a few of the warps of a block.
Hi, I have read this on a paper and want to deploy this to my project.
You can do thread synchronization for the threads of a warp with __syncwarp() and it will make no difference, whether it is two or 32 threads, you want to synchronize. You can exchange the data with shuffle or shared memory.
It makes sense. Many thanks:)