Hallo,

I’d like to CUDAify the following serial code:

[codebox]m=0;

for(i = 0; i < n; i++)

{

```
if(a[permarray[i]] != 0)
{
b[m] = i;
m++;
}
```

}[/codebox]

The arrays a and permarray have n elements each, the permarray’s elements are permuted, unique numbers from 0 to n-1.

As a result array b should contain m permuted positions where a’s element isn’t equal 0, apart from that m is the size of b.

I’ve considered incrementing m with an atomic operation (which would make sure that m is computed correctly):

[codebox]

**shared** m;

int idx = blockIdx.x*blockDim.x + threadIdx.x;

if (idx==0) m=0;

__syncthreads();

```
if (idx < n) {
if (a[permarray[idx]] != 0) {
atomicAdd(&m,1);
b[m]=idx;
}
}
```

__syncthreads();

[/codebox]

… but this doesn’t prevent the threads from writing into the same positions of the b array.

The sequence of positions in b doesn’t matter, they just have to be there - m of them.

Any ideas?