Hi everyone,
I’ve got problems with the following code:
__global__ void myKernel(..............)
{
__shared__ unsigned int POINTS[26][26];
__shared__ int changed;
...............................
#pragma unroll 1 // no unroll
do
{
if (!(threadIdx.y) && !(threadIdx.x))
changed = 0;
__syncthreads();
if ((POINTS[letter1][LETTER] ^ POINTS[LETTER][letter1]))
{
POINTS[letter1][LETTER] = 1;
POINTS[letter2][pos2] = 1;
POINTS[LETTER][letter1] = 1;
POINTS[pos2][letter2] = 1;
changed = 1;
}
__syncthreads();
} while (changed);
...................................
}
The above code will works nicely if there is only one active thread block per multiprocessor. Otherwise, problems will appear: the loop may terminate earlier than expected or eventually hang the whole computer. I suspect the issue is due to one of the two reasons:
-
Out-of-bound writes to shared memory: Every multiprocessor has a single on-chip memory of 16KB allocated to active blocks as shared memory, thus out-of-bound writes from one thread block may interfere the other “concurrently” active thread blocks. I am not sure if the shared memory is protected by hardware or not (another topic has discussed this: http://forums.nvidia.com/index.php?showtopic=89299). I’ve reviewed my program carefully to look for possible out-of-bound writes but found nothing.
-
How the multiprocessor handles __syncthreads(): the above loop contains some __syncthreads() and conditional paths; I am not sure if they muddle the GPU hardware.
The program is run on a GLX260 card (SM version 1.3). The kernel is called with 26x26 blocks of Nx26 threads (N < 17). Currently I have to use a workaround occupying more than 8192 bytes of shared memory to prevent multiple active blocks per multiprocessor.
Anyone interested in the full source code may have a look on my whole project here: https://sourceforge.net/projects/fastbombesim/
I appreciate everyone’s help on this issue.
CVN