Hi

We are “trying” to implement an Algorithm on GPU, but we are suffering from some race conditions.

The kernel below is a computation step, which is called within a loop from the host (see below).

The calculations performed here are:

```
r = y - Ax + b/m*r
tau = alpha*1/sqrt(m)*norm(r,2);
```

where r,y,Ax are vectors and m,alpha,b are scalars

Due to some “optimizations”, we do perform the calculations only on certains indecies of r, which are specified in the bool array d_mask.

But with this implementation we get some race conditions, we do not know where they come from.

On my Laptop (HP 8440p with NVidia NVS 3100M - Computecapabilty 1.2) the cumulative sum (*d_tau_add_tmp) is not always properly calculated. Most of the time it is correct but sometimes (1:10) the value is a little bit off (around ±20%)

We tested the same code on a Lenovo w520 (Nvidia m2000), but there the race conditions do not occur.

We are aware that the code could be substantially improved in terms of performance, but we would like to understand the reason behind the race conditions first.

We are looking forward for your support!

Kernel:

```
__global__ void GPUampP1(const float* d_y, float* d_r, float* d_r2, float* d_Ax, float* d_b, float* d_tau, const float alpha, const int n, const bool* d_mask, const int m, float* d_tau_add_tmp)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if(idx < n)
{
// r = y - Ax + b/m*r
if(d_mask[idx])
{
d_r[idx] = d_y[idx] - d_Ax[idx] + *d_b / m * d_r[idx] ;
} else
{
d_r[idx] = 0;
}
}
__syncthreads();
//tau = alpha*1/sqrt(m)*norm(r,2);
if(idx < n)
{
if(d_mask[idx])
d_r2[idx] = d_r[idx] * d_r[idx];
}
__syncthreads();
// Racecondition seems to be in this part:
if(idx < 1);
{
*d_tau_add_tmp = 0;
for(int i = 0;i < n; i++)
if(d_mask[i])
*d_tau_add_tmp += d_r2[i]; //<<-- the sum d_tau_add_tmp is not always correct (just little deviation)
float f = *d_tau_add_tmp;
*d_tau = sqrt(f/m) * alpha;
}
__syncthreads();
}
```

Host:

```
for(int i = 0; i < len; i++)
{
// some other kernel calls
GPUampP1<<<blocksPerGrid, threadsPerBlock>>>(d_y,d_r,d_r2, d_Ax, d_b, d_tau, alpha, n, d_mask, m, d_tau_add_tmp);
cudaDeviceSynchronize();
getLastCudaError("kernel launch failure");
// some other kernel calls
}
```