Do __syncwarp() and __ballot_sync() protect global write scheduling?

I know that __syncthreads() is guaranteed to make sure that all global writes and atomic operations issued before the call are finished by the time the thread block is allowed to proceed. While this will not protect against race conditions if other thread blocks are competing for the same memory, but if thread blocks are operating on exclusive regions of the global memory then I’ve learned (and confirmed with much code and testing since) that __syncthreads()is something one can lean on.

Is the same true of __syncwarp() if warps are operating on their own exclusive sectors of global memory? If not, I can work around it and get back to __syncthreads(), but the code would be cleaner and probably a little faster if I didn’t have to.

Cheers!

See this section from the programming guide. CUDA C++ Programming Guide

Executing __syncwarp() guarantees memory ordering among threads participating in the barrier. Thus, threads within a warp that wish to communicate via memory can store to memory, execute __syncwarp(), and then safely read values stored by other threads in the warp.

Then there is CUDA C++ Programming Guide which states that for

int __all_sync(unsigned mask, int predicate);
int __any_sync(unsigned mask, int predicate);
unsigned __ballot_sync(unsigned mask, int predicate);
unsigned __activemask();

These intrinsics do not imply a memory barrier. They do not guarantee any memory ordering.

1 Like

Hooray, thanks for the clear explanation (albeit somewhat surprising–I will need to refine my understanding if warp vote (and, I have now checked, __shfl_sync()) do not imply a memory barrier. I had assumed that they all just had __syncwarp() internally, but __syncwarp() is indeed a taller order, it seems!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.