This code should be executed by each thread of the grid. I wrote it just to check if the elements of the original matrix get loaded to the right places in the shared memory. Test fails. After examining print statements I found out that only first BLOCK_SIZE+2 elements of the grid get processed (bx=0,by=0, tx goes from 0 to block size, ty=0), + first 2 elements from the next row.
It seems that as soon as the boundary flag is set to 1, the processing stops. I thought that boundary will be local to each thread, but it seems that either it isn’t or something else is wrong with this code. Here it is:
kernel code:
...
extern __shared__ float AS[];
unsigned int boundary=0;
int k2=....
int k=...
if (f(tx,ty)==...) boundary=1;
if (!boundary) {
AS[k2]=A[k];
printf("A[%d]=%g, loaded AS[%d]=%g\n", A[k], AS[k2]);
}
Output
Index k2=0=(0,0,0,0), boundary flag = 1
Index k2=1=(0,0,1,0), boundary flag = 1
Index k2=2=(0,0,2,0), boundary flag = 1
Index k2=3=(0,0,3,0), boundary flag = 1
Index k2=4=(0,0,4,0), boundary flag = 1
Index k2=5=(0,0,5,0), boundary flag = 1
Index k2=6=(0,0,6,0), boundary flag = 1
Index k2=7=(0,0,7,0), boundary flag = 1
Index k2=8=(0,0,8,0), boundary flag = 1
Index k2=9=(0,0,9,0), boundary flag = 1
Index k2=10=(0,0,10,0), boundary flag = 1
Index k2=11=(0,0,11,0), boundary flag = 1
Index k2=12=(0,0,12,0), boundary flag = 1
Index k2=13=(0,0,13,0), boundary flag = 1
Index k2=14=(0,0,14,0), boundary flag = 1
Index k2=15=(0,0,15,0), boundary flag = 1
Index k2=16=(0,0,16,0), boundary flag = 1
Index k2=17=(0,0,17,0), boundary flag = 1
Index k2=36=(0,0,0,1), boundary flag = 1
Index k2=37=(0,0,1,1), boundary flag = 0
What could be wrong here and how would you recommend to fix it?