Greetings.

I have a kernel with a “while” loop, which iteratively updates elements of an array using information about neighbors

(only one neighbor in the sample code below). This loop stops when no element is changed at the current iteration.

Unfortunately, in some situations part of threads go out of this loop prematurely (like if they ignore synchronization barrier).

Some inputs are processed correctly every time, and other inputs (many of them) are processed incorrectly every time

(i.e. there are no stochastic factors). Strangely, this error occurs only in Release version while Debug version always

worked fine. More precisely, the CUDA compiler option “-G (Generate GPU Debug Information)” determines whether the

processing is correct. Arrays of size 32x32 or smaller are always processed correctly.

Here is a sample code:

```
__global__ void kernel(int *source, int size, unsigned char *result, unsigned char *alpha)
{
int x = threadIdx.x, y0 = threadIdx.y * 4;
int i, y;
__shared__ bool alpha_changed;
// Zero intermediate array using margins for safe access to neighbors
const int stride = MAX_SIZE + 2;
for (i = threadIdx.x + threadIdx.y * blockDim.x; i < stride * (stride + 3); i += blockDim.x * blockDim.y)
{
alpha[i] = 0;
}
__syncthreads();
for (int bit = MAX_BITS - 1; bit >= 0; bit--)
{
__syncthreads();
// Fill intermediate array with bit values from input array
alpha_changed = true;
alpha[(x + 1) + (y0 + 1) * stride] = (source[x + (y0 + 0) * size] & (1 << bit)) != 0;
alpha[(x + 1) + (y0 + 2) * stride] = (source[x + (y0 + 1) * size] & (1 << bit)) != 0;
alpha[(x + 1) + (y0 + 3) * stride] = (source[x + (y0 + 2) * size] & (1 << bit)) != 0;
alpha[(x + 1) + (y0 + 4) * stride] = (source[x + (y0 + 3) * size] & (1 << bit)) != 0;
__syncthreads();
// The loop in question
while (alpha_changed)
{
alpha_changed = false;
__syncthreads();
if (alpha[(x + 0) + (y0 + 1) * stride] != 0 && alpha[(x + 1) + (y0 + 1) * stride] == 0)
{
alpha_changed = true;
alpha[(x + 1) + (y0 + 1) * stride] = 1;
}
__syncthreads();
if (alpha[(x + 0) + (y0 + 2) * stride] != 0 && alpha[(x + 1) + (y0 + 2) * stride] == 0)
{
alpha_changed = true;
alpha[(x + 1) + (y0 + 2) * stride] = 1;
}
__syncthreads();
if (alpha[(x + 0) + (y0 + 3) * stride] != 0 && alpha[(x + 1) + (y0 + 3) * stride] == 0)
{
alpha_changed = true;
alpha[(x + 1) + (y0 + 3) * stride] = 1;
}
__syncthreads();
if (alpha[(x + 0) + (y0 + 4) * stride] != 0 && alpha[(x + 1) + (y0 + 4) * stride] == 0)
{
alpha_changed = true;
alpha[(x + 1) + (y0 + 4) * stride] = 1;
}
__syncthreads();
}
__syncthreads();
// Save result
result[x + (y0 + 0) * size + bit * size * size] = alpha[(x + 1) + (y0 + 1) * stride];
result[x + (y0 + 1) * size + bit * size * size] = alpha[(x + 1) + (y0 + 2) * stride];
result[x + (y0 + 2) * size + bit * size * size] = alpha[(x + 1) + (y0 + 3) * stride];
result[x + (y0 + 3) * size + bit * size * size] = alpha[(x + 1) + (y0 + 4) * stride];
__syncthreads();
}
}
// Run only 1 thread block, where size equals 64.
kernel <<< 1, dim3(size, size / 4) >>> (source_gpu, size, result_gpu, alpha_gpu);
```

The expected result of this sample kernel is array, where each line can contain only contiguous intervals

of “1” values. But instead of this, I get some lines, where “0” and “1” are somehow alternated.

This error is reproduced on my mobile GPU GeForce 740M (Kepler), on Windows 7 x64 SP1, on either CUDA 6.0 or 6.5,

using either Visual C++ 2012 or 2013. I can also provide a sample Visual Studio project with the sample input array (i.e. which is processed incorrectly).

I have already tried different configurations of syncthreads(), fences and “volatile” qualifier, but this error

remained.

Any help is appreciated.