Issue using bar.cta.sync and arrive

yxxchen · September 26, 2023, 8:25pm

Hi experts,

I have an issue using bar.cta.sync and bar.cta.arrive correct for a producer and consumer model.

After reading the page here: 1. Introduction — parallel-thread-execution 8.2 documentation, I am trying to use bar.cta.sync and bar.cta.arrive to form a producer and consumer computation in shared memory. Below is an example code to achieve this functionality but it doesn’t serve my purpose. I am wondering where I did it wrong and if could anyone help me out.

The problem with the following code, the consumer read the wrong value from the shared memory.

__device__ void membar_arrive_producer(int producer_size)
{
    asm("barrier.cta.arrive 0, %0;" :: "r"(producer_size));
}

__device__ void membar_sync_consumer(int producer_size)
{
    asm("barrier.cta.sync 0, %0;" :: "r"(producer_size));
}


//single buffer
__global__ void membar_test(int *in, int *out)
{
    __shared__ int shared[64];
    if(threadIdx.x < 64) // producer
    {
        shared[threadIdx.x] = in[threadIdx.x];
        membar_arrive_producer(64);
    }
    else { // Consumer
        membar_sync_consumer(64);
        int item = shared[(threadIdx.x-64)%64];
        item = item+((threadIdx.x-64)/64);
        out[threadIdx.x-64] = item;
    }
}

If I change the code a bit into the following, I expect the kernel to be stuck in a deadlock, but the kernel exits. Could anyone explain to me why?

__device__ void membar_arrive_producer(int producer_size)
{
    asm("barrier.cta.arrive 0, %0;" :: "r"(producer_size));
}

__device__ void membar_sync_consumer(int producer_size)
{
    asm("barrier.cta.sync 0, %0;" :: "r"(producer_size));
}


//single buffer
__global__ void membar_test(int *in, int *out)
{
    __shared__ int shared[64];
    if(threadIdx.x < 64) // producer
    {
        shared[threadIdx.x] = in[threadIdx.x];
    }
    else { // Consumer
        membar_sync_consumer(64);
        int item = shared[(threadIdx.x-64)%64];
        item = item+((threadIdx.x-64)/64);
        out[threadIdx.x-64] = item;
    }
}

Many thanks in advance!

Yuxin

One more try with this kernel:

__global__ void test() {
if(threadIdx.x < 64)
{
        asm("bar.sync 0;");
}
}

Weirdly, the above code doesn’t give me deadlock which doesn’t make sense to me. Could anyone explain to me what is going on here?

Robert_Crovella · September 26, 2023, 10:56pm

It’s difficult to achieve full clarity when you don’t show something important like how many threads you are launching. In the future, I suggest providing complete examples. I will assume it is 128 threads in the block.

In that case, try setting your barrier count target to 128 instead of 64:

    membar_arrive_producer(128);
    ...
    membar_sync_consumer(128);

Threads increment the barrier count when they reach the barrier. Threads executing the .arrive variant do not wait at the barrier. Threads executing the .sync variant do wait at the barrier. The waiting threads are released when the sum total of threads executing either variant (for that barrier number) match the barrier count target.

If you are launching 128 threads per block, then the upper 2 warps “race ahead” past the if clause, and begin arriving at membar_sync_consumer, thereby incrementing (for each thread) the barrier count. When both of those upper two warps arrive there, your original barrier count target is satisfied, and those threads proceed past the barrier. This is entirely independent of what the first 2 warps may be doing. That is presumably not your intent. By setting the barrier count target to 128, we require the first 64 threads to have reached their membar_arrive_producer() point (and we do also require all of the upper 64 threads to arrive at the membar_sync_consumer() point) before any waiting thread is allowed to proceed past the membar_sync_consumer() point.

Hopefully this is now evident. Your second set of 64 threads satisfy the barrier count target.

Exited threads contribute to the barrier count.

In my view it should not be necessary to use PTX to achieve a producer-consumer model. This can be done directly in CUDA C++ using libcu++ barriers and cooperative groups, and there are also other libcu++ provided methods.

yxxchen · September 26, 2023, 11:24pm

Hi Robert,

Thanks for your reply. I launched 128 threads per block and only 1 block. Here is my main function:

int main() {

    int * buffer_in;
    int * buffer_out;

    int buffer_in_h[64];
    int buffer_out_h[448];
    for(int i=0; i<64; i++)
    {
        buffer_in_h[i] = 1;
    }

    CUDA_CHECK(cudaMalloc(&buffer_in, sizeof(int)*64));
    CUDA_CHECK(cudaMalloc(&buffer_out, sizeof(int)*32*14));
    CUDA_CHECK(cudaMemcpy(buffer_in, buffer_in_h, sizeof(int)*64, cudaMemcpyHostToDevice));

    membar_test<<<1, 128>>>(buffer_in, buffer_out);
    cudaError_t error = cudaGetLastError();
    if (error != cudaSuccess) {
        printf("CUDA error: %s\n", cudaGetErrorString(error));
    }
    CUDA_CHECK(cudaDeviceSynchronize());

    CUDA_CHECK(cudaMemcpy(buffer_out_h, buffer_out, sizeof(int)*448, cudaMemcpyDeviceToHost));

    for(int i=0; i<448; i=i+64)
    {
        std::cout << i << ": "<< buffer_out_h[i] << std::endl;
    }
        
    return 1;
}

In my case, the threads 0-63 are the producer, and 64-127 are the consumers.

So, bar.sync will increase the bar value by each thread who runs it. bar.arrive also increases the bar value by each thread that runs it. To make this producer and consumer work, I should have

{
    __shared__ int shared[64];
    if(threadIdx.x < 64) // producer
    {
        shared[threadIdx.x] = in[threadIdx.x];
        membar_arrive_producer(128);
        //asm("bar.sync 0;");
    }
    else {
        membar_sync_consumer(128);
        int item = shared[(threadIdx.x-64)%64];
        assert(item == 1);
        item = item+((threadIdx.x-64)/64);
        out[threadIdx.x-64] = item;
    }
}

"The key distinction between arrive and sync lies in the behavior when the bar value reaches b. In the case of arrive, it resets the bar value to 0 when bar equals b.

However, consider a scenario where arrive increments the bar value before sync is executed. If arrive increases the ‘bar’ value but doesn’t encounter a situation where bar equals b, it won’t have the opportunity to reset the bar value back to 0. Am I right here? If I am correct, then what should I do to fix that?

Robert_Crovella · September 26, 2023, 11:35pm

I have no idea where you got that. The key distinction in behavior is that threads executing .arrive do not wait at the barrier. Threads executing .sync do wait for “completion” to use the word in the PTX manual.

It should be self-evident from the modification I suggested, that you do not have enough threads executing the arrive variant to meet the barrier count target. Yet I maintain it is valid and correct code, and empirically also appears to work.

I don’t know that I understand the remainder, and might need an example.

It is explicitly not expected for a barrier instruction to be executed more than once by any thread in the process of completing a particular numbered barrier. Therefore positing a case where you do not have enough threads executing a barrier number in order to meet the barrier count target for that barrier is just broken code. I cannot tell you generically how to fix it, except to say you need to refactor your code so it is possible for the number of threads specified in the barrier count target to actually execute a barrier op for that numbered barrier.

yxxchen · September 26, 2023, 11:39pm

Hi Robert,

Thanks a lot for your clarification. I think I do make a mistake in terms of: The key distinction between arrive and sync lies in the behavior when the bar value reaches b . In the case of arrive , it resets the bar value to 0 when bar equals b .

Both arrive and sync should reset the barrier count when it reaches the b. I guess?

Many many thanks for explaining how the sync and arrive works! It has been really helpful!

Again, thank you for your help!

Yuxin

Robert_Crovella · September 26, 2023, 11:42pm

I would say that when the barrier has been completed, it can be reused. That is almost a direct quote from the PTX manual:

When a barrier completes, the waiting threads are restarted without delay, and the barrier is reinitialized so that it can be immediately reused.

But I’m not going to argue with your phrasing. I think it captures the right idea.

system · October 10, 2023, 11:43pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What is the difference between barrier.arrive.relaxed and barrier.arrive? CUDA Programming and Performance	9	172	August 2, 2024
CUDA Kernel self-suspension ? Can a CUDA Kernel conditionally suspend its execution ? CUDA Programming and Performance	46	45208	April 17, 2011
using PTX barrier.sync CUDA Programming and Performance	12	3976	March 27, 2019
How to Handle Synchronization with Different Thread Counts for Producer and Consumer in CUTLASS? CUDA Programming and Performance	0	32	December 18, 2024
Global thread barrier CUDA Programming and Performance	78	85673	December 23, 2011
about the __syncwarp() in P100 CUDA Programming and Performance	11	4066	June 6, 2018
Controlling context switching in CUDA CUDA Programming and Performance	16	4448	April 29, 2013
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23204	October 7, 2010
Problem with correct branching within a warp CUDA Programming and Performance	23	15640	May 28, 2009
Understanding bar.sync and the Role of thread_count in bar.arrive CUDA Programming and Performance	2	38	December 20, 2024

Issue using bar.cta.sync and arrive

Related topics