using PTX barrier.sync

I have read https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar which details about PTX synchronization function.

  1. It says there are 16 “barrier logical resource”, and you can specify which barrier to use with the parameter “a”. What is a barrier logical resource?

  2. I have a piece of code from an outside source, which I know works. However, I cannot understand the syntax used inside “asm” and what “memory” does. I assume “name” replaces “%0” and “numThreads” replace “%1”, but what is “memory” and what are the colons doing?

__device__ __forceinline__ void namedBarrierSync(int name, int numThreads) {
      asm volatile("bar.sync %0, %1;" : : "r"(name), "r"(numThreads) : "memory");
}
  1. In a block of 256 threads, I only want threads 64 ~ 127 to synchronize. Is this possible with barrier.sync
    function? ( for an example, say I have a grid of 1 block, block of 256 threads. we split the block into 3 conditional branches s.t. threads 0 ~ 63 go into kernel1, threads 64 ~ 127 go into kernel 2, and threads 128 ~ 255 go into kernel 3. I want threads in kernel 2 to only synchronize among themselves. So if I use the “namedBarrierSync” function defied above: “namedBarrierSync( 1, 64)”. Then does it synchronize only threads 64 ~ 127, or threads 0 ~ 63?

  2. I have tested with below code ( assume that gpuAssert is an error checking function defined somewhere in the file ):

__global__ void test(int num_threads) {
    if (threadIdx.x >= 64 && threadIdx.x < 128) {
        namedBarrierSync(1, num_threads);
    }
    __syncthreads();
}

int main(void) {
    test<<<1, 1, 256>>>(128);
    gpuAssert(cudaDeviceSynchronize(), __FILE__, __LINE__);
    printf("complete\n");
    return 1;
}

It returns without any error or hanging when num_threads == 128 and 64. This doesn’t make sense to me, since only 64 threads are going inside namedBarrierSync. So shouldn’t the program hang when num_threads == 128?

For item 2 (general understanding of the inline PTX syntax) you may wish to read the relevant doc:

https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html

Note that your test is only launching a single thread, which has allocated 256 bytes of shared memory to it. So any synchronisation instruction inside the kernel just reduces to a no-op.

You probably want to launch your kernel as

test<<<1, 256>>>(128);

instead.

To Rober_Crovella,

Thanks! in fact, minutes after posting the question, I found the documentation you linked. I should edit the original question.

I do have a question regarding “memory” even after reading it, but I will post it on a different thread.

To tera,

Oh… I will test again and get back ASAP. Thanks a lot!

I have played with bar.sync, and here are the conclusions from my observation.

For item 3: it seems like the syntax of bar.sync is: synchronize num_threads number of threads starting from the thread with the lowest thread index that calls the function. For an example, like the above code, if threads from 64 ~ 127 entered the function with “num_threads = 64” , synchronization happens for 64 threads starting from thread with threadIdx.x == 64 ( not with threadIdx.x == 0 ). To put it in another way, num_threads is the number of threads that synchronizes relative to the thread that calls the bar.sync, and not the absolute thread index.

for item 1: I still am not sure about the 16 different barrier resources, but my hypothesis is that: if thread 0 calls bar.sync with barrier 0 and num_threads 2, then thread 1 should also call bar.sync or bar.arrive with barrier 0, and not some other barrier.

Please correct me if I am wrong.

Rather than speculate, why not read the documentation?
It specifically says

I didn’t originally link to it because you already posted the link yourself in your cross-post on StackOverflow.

So to make it clear, there is no notion of “relative thread indices” in CUDA. The threads getting synchronized are those whose warps encounter the barrier instruction.

You are correct about the named barriers. All threads participating in a named barrier need to pass the same number between 0…15 as the first argument to the barrier instruction. If some threads were passing in a different argument, this would create a deadlock as you now have two groups of threads waiting in different barriers, each of them waiting for the other group to arrive

I had read the sentences ( multiple times infact ), but it wasn’t clear to me what “all other warps participating in the barrier” meant ( and I am still not completely sure ). The second operand specifies the number of threads participating in the synchronization. So does it mean the function doesn’t care whether the warps are contiguous or not? It just waits for that many threads ( any thread ) from the block to reach the synchronization point? Or is it a consecutive warps of specific threads?

A warp is participating in the barrier as soon as one of its threads hits the barrier instruction.

Or to describe it from a different angle, there is no way a thread could encounter a barrier instruction and decide “this one is not for me”.
A thread always participates in barriers it encounters, and the named barrier argument only determines which other threads it synchronises with (namely, the other threads encountering the same named barrier).

All the barrier instruction does is to wait for a given number of warps (The number of threads really is a number of warps times 32) encountering the same named barrier. As soon as that number is reached, al threads are released to continue.

Oh I see. So barrier instruction only cares for the number of threads, and doesn’t care for the coordinate of threads or warps. Okay, it makes so much more sense.

If I launch a grid with 1 block of 128 threads, and if num_threads == 64, the first thread that hits the barrier would wait for 63 other threads to hit. Then they would go on. 65th thread that hit the barrier would wait for 63 other threads to hit, then move on.

Would this be approximately correct?

Yes. Just be careful that synchronization is on a warp level, not individual threads.
So the first warp to hit a barrier with thread count N waits for (N/32)-1 other warps.

Thanks a lot, tera! Saved my day :)

I think there might be a problem here with the use of bar.sync instead of barrier.sync.

From the documentation:
bar.sync is equivalent to barrier.sync.aligned.

Also:
Instruction barrier has optional .aligned modifier. When specified, it indicates that all threads in CTA will execute the same barrier instruction. In conditionally executed code, an aligned barrier instruction should only be used if it is known that all threads in CTA evaluate the condition identically, otherwise behavior is undefined.

This documentation seems to indicate that you should use “barrier.sync” instead of “bar.sync” in your example, because not all threads execute the barrier instruction.