Local variables and thread/warp usage help with figuring out the problem

This code should be executed by each thread of the grid. I wrote it just to check if the elements of the original matrix get loaded to the right places in the shared memory. Test fails. After examining print statements I found out that only first BLOCK_SIZE+2 elements of the grid get processed (bx=0,by=0, tx goes from 0 to block size, ty=0), + first 2 elements from the next row.

It seems that as soon as the boundary flag is set to 1, the processing stops. I thought that boundary will be local to each thread, but it seems that either it isn’t or something else is wrong with this code. Here it is:

kernel code:

       ...

	extern __shared__  float AS[];

	unsigned int boundary=0;

        int k2=....

        int k=...

        if (f(tx,ty)==...) boundary=1;

        if (!boundary) {

           AS[k2]=A[k];

          printf("A[%d]=%g, loaded AS[%d]=%g\n", A[k], AS[k2]);

       }

Output

Index k2=0=(0,0,0,0), boundary flag = 1

Index k2=1=(0,0,1,0), boundary flag = 1

Index k2=2=(0,0,2,0), boundary flag = 1

Index k2=3=(0,0,3,0), boundary flag = 1

Index k2=4=(0,0,4,0), boundary flag = 1

Index k2=5=(0,0,5,0), boundary flag = 1

Index k2=6=(0,0,6,0), boundary flag = 1

Index k2=7=(0,0,7,0), boundary flag = 1

Index k2=8=(0,0,8,0), boundary flag = 1

Index k2=9=(0,0,9,0), boundary flag = 1

Index k2=10=(0,0,10,0), boundary flag = 1

Index k2=11=(0,0,11,0), boundary flag = 1

Index k2=12=(0,0,12,0), boundary flag = 1

Index k2=13=(0,0,13,0), boundary flag = 1

Index k2=14=(0,0,14,0), boundary flag = 1

Index k2=15=(0,0,15,0), boundary flag = 1

Index k2=16=(0,0,16,0), boundary flag = 1

Index k2=17=(0,0,17,0), boundary flag = 1

Index k2=36=(0,0,0,1), boundary flag = 1

Index k2=37=(0,0,1,1), boundary flag = 0

What could be wrong here and how would you recommend to fix it?

If you daclare “boundary” outside of kernel function it will be global, otherwise it will be local.

It will be much easier to help you if you post here full source of your kernel and variables declarations and kernel launching code.

The problem had been caused by a nested __syncthreads call. Thanks for the help!