In CUDA Programming Guide it says
Blockquote waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads()
are visible to all threads in the block.
However, I found that in practice the kernel does not hang if some of the threads within a block returns before reaching __syncthreads()
(which is a common case when the last block is not fully utilized). I wonder why this happens. Should the documentation be modified or the behaviour of the kernel function under this situation is undefined?
__syncthreads()
internally uses the barrier synchronization, see PTX ISA 8.4
The description there indicates that exited threads are just ignored:
Operand b
specifies the number of threads participating in the barrier. If no thread count is specified, all threads in the CTA participate in the barrier. When specifying a thread count, the value must be a multiple of the warp size.
barrier{.cta}
instruction causes executing thread to wait for all non-exited threads from its warp and marks warps’ arrival at barrier.
1 Like
The ptx info provided is a sensible answer as to
However, the documentation for CUDA C++ should be adhered to. Just because something appears to work does not mean that it is correctly written code.
2 Likes
Thank you! That makes a lot of sense. I guess I should turn to cuda::barrier for finer synchronization control.
1 Like
That is good as you do not just assume how it internally works/compiles, but adhere to the C++ documentation, as also Robert recommends.