Issues with atomicAdd when doing Inter block sync

The issue i am having is that i am attempting to do INTER BLOCK SYNCHRONIZATION, meaning i would like a global barrier where all threads in all blocks synchronize at. For testing purposes the code below is run with only 1 thread per block . When the following code is run on a GTX 480 with cuda 3.1 , the code deadlocks and the kernel never finishes ( WITHOUT THE MANUAL TIMEOUT). As seen below i added a manual timeout to examine what was going on.

Here is the code i am working with

[codebox]

device int g_barrier;

device void g_sync(int blocks)

{

int oldVal,timer=0,tid = threadIdx.x;

if (tid==0)

{

oldVal=atomicAdd(&g_barrier,1);

printf(" %d",oldVal);

}

while(g_barrier!=blocks)

{

 timer+=1;

 if (timer==1000000)

    break;

}

}

global void simple_kernel(int blocks)

{

int i;

int ub=4556;

for(i=0;i<ub;i++)

{

// EMPTY LOOP 

}

g_barrier=0;

g_sync(blocks);

if(blockIdx.x==0)

{

  printf("\n g_barrier: %d",g_barrier);

  printf("\nend");

}  

}

int main(int argc,char **argv)

{

int blocks;

printf("ENTER NUMBER OF BLOCKS: ");

scanf("%d",&blocks);

simple_kernel<<<blocks,1>>>(blocks);

cudaThreadExit();

printf("\n");

return 0;

}[/codebox]

The output of this code is as follows :

0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

(output is sorted to show the issue)

More than one block returns the same “old value” when performing the atomicAdd which should not happen

please test with 100 blocks and 1 thread per block.This is actually a simple kernel i have created to isolate the problem i am having with my cuda program. Even though in this test kernel i can only reproduce the problem with 100 blocks , my cuda program exhibits similar behavior with varying number of blocks

I compile with the following options

nvcc -g -G -arch sm_20 -Xptxas -dlcm=cg

Also i run on a fermi card that DOES NOT have X running on it. The machine runs openSuSe 11.1 x84_64

the -dlcm=cg flag was to ensure that the g_barrier was not cached but it didnt make a difference

Could someone reproduce this problem ?

Also could anyone shed any light on what i am missing ??

I thought i should add that the addition of the for loop is what causes the code to break.
This simple test kernel has an empty loop to demonstrate the issue but my real program has a for loop with code

I thought i should add that the addition of the for loop is what causes the code to break.
This simple test kernel has an empty loop to demonstrate the issue but my real program has a for loop with code

The atomicAdd() is working well. You get the same values multiple times because you reset g_barrier at the beginning of each block. Set it to zero only once before launching the kernel.

However, as tmurray will tell you soon anyway, [font=“Arial Black”]do not attempt to invent a global barrier[/font]. This is impossible to get right since not all blocks are running at the same time. If you need a global barrier, launch a new kernel instead.

The atomicAdd() is working well. You get the same values multiple times because you reset g_barrier at the beginning of each block. Set it to zero only once before launching the kernel.

However, as tmurray will tell you soon anyway, [font=“Arial Black”]do not attempt to invent a global barrier[/font]. This is impossible to get right since not all blocks are running at the same time. If you need a global barrier, launch a new kernel instead.

ahhh right…that was stupid, just wanted to initialize the barrier variable … it works fine now. i changed it to use cudaMemset(&gbarrier,0,sizeof(int) )

As for interblock sync, Fermi schedules 8 blocks per SM. So as long as you stick to 120 blocks as a maxmimum ( 8 times 15 cores) , it is actually possible to get a global barrier to work.
I am sure launching the new kernel is a much better implementation but in case some one absolutely cannot do that, this is an alternative.

I can refer you to this paper i found for more information/data :

Inter-Block GPU Communication via Fast Barrier Synchronization.
Shucai Xiao, Wu-chun Feng.
In Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Atlanta, Georgia, USA, April 2010.

ahhh right…that was stupid, just wanted to initialize the barrier variable … it works fine now. i changed it to use cudaMemset(&gbarrier,0,sizeof(int) )

As for interblock sync, Fermi schedules 8 blocks per SM. So as long as you stick to 120 blocks as a maxmimum ( 8 times 15 cores) , it is actually possible to get a global barrier to work.
I am sure launching the new kernel is a much better implementation but in case some one absolutely cannot do that, this is an alternative.

I can refer you to this paper i found for more information/data :

Inter-Block GPU Communication via Fast Barrier Synchronization.
Shucai Xiao, Wu-chun Feng.
In Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Atlanta, Georgia, USA, April 2010.

Don’t do this. It’s not supported. It will break. You will not get the perf you want. You will randomly find problems and have no idea what’s going on and that’s because it’s not supported.

Don’t do this. It’s not supported. It will break. You will not get the perf you want. You will randomly find problems and have no idea what’s going on and that’s because it’s not supported.

To anyone reading this, better listen to the experienced programmers, i am gonna scrap this inter block sync code and modify my program to launch two kernels

To anyone reading this, better listen to the experienced programmers, i am gonna scrap this inter block sync code and modify my program to launch two kernels

Yeah, I know how to do this on current hardware with current drivers, even with an unknown number of SMs and blocks per SM. However, if you look at the PTX documentation, Nvidia reserves the right to suspend blocks and even reschedule them to different SMs. I don’t know why they would want to do this, but it means that with any new version of the driver, or possibly even with the current driver under situations not encountered before, your code might suddenly break.

Yeah, I know how to do this on current hardware with current drivers, even with an unknown number of SMs and blocks per SM. However, if you look at the PTX documentation, Nvidia reserves the right to suspend blocks and even reschedule them to different SMs. I don’t know why they would want to do this, but it means that with any new version of the driver, or possibly even with the current driver under situations not encountered before, your code might suddenly break.