Local Thread Synchronization

markusxwr · June 25, 2024, 2:34pm

Hi everyone,
I am making a cuda project, where I want to implement a local thread synchronization instead of the commonly used function __syncthreads(). My task is done by 128 threads and each of them need to do several stages of operations.
Why is the local synchronization appliable is that in each stage, every 2 threads have data dependencies on their last stage. So I only need to let them 2 threads to wait, instead of letting all threads wait. I create an array of type bool to save the status of the thread.
I have tested the function, and it runs well when doing this with kernel function with 1 block of 128 threads. But if I deploy more blocks, for example 256 blocks with 128 threads each. The results comes wrong and instable, which means the synchronization has not run correctly. Does anyone have idea about the cause? Many thanks.

Curefab · June 25, 2024, 2:57pm

Why would you do a synchronization with less than warp granularity? You can use the barrier instruction to do synchronization with only a few of the warps of a block.

markusxwr · June 25, 2024, 2:59pm

Hi, I have read this on a paper and want to deploy this to my project.

Curefab · June 25, 2024, 3:24pm

You can do thread synchronization for the threads of a warp with __syncwarp() and it will make no difference, whether it is two or 32 threads, you want to synchronize. You can exchange the data with shuffle or shared memory.

markusxwr · June 26, 2024, 11:08am

It makes sense. Many thanks:)

Topic		Replies	Views
synchronization CUDA Programming and Performance	8	3152	February 26, 2012
syncronize a warp CUDA Programming and Performance	8	2942	August 25, 2008
synchronization and block independence CUDA Programming and Performance	3	1628	December 19, 2009
How can I be certain my Kernel runs with 32 threads in one block and thus perfect synchrony? (ie. via __syncthreads()) CUDA Programming and Performance	15	313	August 21, 2024
Is syncthreads required within a warp? CUDA Programming and Performance	10	12783	November 8, 2013
Warp Synchronisation Problem? CUDA Programming and Performance	4	3950	November 30, 2008
Global Sum on Multiprocessor CUDA Programming and Performance	6	7380	July 6, 2007
Synchronizing only subset of CUDA warps in block CUDA Programming and Performance	12	1293	June 18, 2025
CUDA Warp Synchronization Problem CUDA Programming and Performance	5	2282	February 27, 2011
Confusion about __syncwarp() if all threads in a warp are automatically in sync? CUDA Programming and Performance	3	2053	March 10, 2020

Local Thread Synchronization

Related topics