What is the barrier resource that __syncthreads or other cuda runtime-level synchronization function use? Is it fixed to 0?
if bar.sync uses the same barrier resource as __syncthreads does, then how would they interact?
__syncthreads() used to translate to a bar.sync instruction on barrier #0. Whether that is still the case with CUDA 10.0 and compute capability 7.x I am not sure, but it would be a simple experiment to compile a kernel to PTX and check for yourself.
Since there is no specification of this that I am aware of, I view it as an implementation detail, and therefore a hazard to depend on any particular behavior, from a code correctness point of view.
Any time you have to ask for unpublished information, or disassemble code to inspect compiler behavior, a flag should be raised in your thought process that indicates that what you are observing may not be dependable behavior for code correctness.
Oh, I’m such a noob. Of course, I should have tried to look at the assembly.
Do you have any tips for reading assembly?
thanks, I will keep that in mind. A safe method would be to only use bar.sync instead of __syncthreads, if I know I have to use bar.sync at least once.