The issue i am having is that i am attempting to do INTER BLOCK SYNCHRONIZATION, meaning i would like a global barrier where all threads in all blocks synchronize at. For testing purposes the code below is run with only 1 thread per block . When the following code is run on a GTX 480 with cuda 3.1 , the code deadlocks and the kernel never finishes ( WITHOUT THE MANUAL TIMEOUT). As seen below i added a manual timeout to examine what was going on.
Here is the code i am working with
[codebox]
device int g_barrier;
device void g_sync(int blocks)
{
int oldVal,timer=0,tid = threadIdx.x;
if (tid==0)
{
oldVal=atomicAdd(&g_barrier,1);
printf(" %d",oldVal);
}
while(g_barrier!=blocks)
{
timer+=1;
if (timer==1000000)
break;
}
}
global void simple_kernel(int blocks)
{
int i;
int ub=4556;
for(i=0;i<ub;i++)
{
// EMPTY LOOP
}
g_barrier=0;
g_sync(blocks);
if(blockIdx.x==0)
{
printf("\n g_barrier: %d",g_barrier);
printf("\nend");
}
}
int main(int argc,char **argv)
{
int blocks;
printf("ENTER NUMBER OF BLOCKS: ");
scanf(“%d”,&blocks);
simple_kernel<<<blocks,1>>>(blocks);
cudaThreadExit();
printf(“\n”);
return 0;
}[/codebox]
The output of this code is as follows :
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
(output is sorted to show the issue)
More than one block returns the same “old value” when performing the atomicAdd which should not happen
please test with 100 blocks and 1 thread per block.This is actually a simple kernel i have created to isolate the problem i am having with my cuda program. Even though in this test kernel i can only reproduce the problem with 100 blocks , my cuda program exhibits similar behavior with varying number of blocks
I compile with the following options
nvcc -g -G -arch sm_20 -Xptxas -dlcm=cg
Also i run on a fermi card that DOES NOT have X running on it. The machine runs openSuSe 11.1 x84_64
the -dlcm=cg flag was to ensure that the g_barrier was not cached but it didnt make a difference
Could someone reproduce this problem ?
Also could anyone shed any light on what i am missing ??