how to output result in various length?

I have faced a problem in CUDA programming when dealing with various length output

The question is:

I have a unsigned char sequence in share memory which only contains 0,1 like 00010000111000…,the length of this sequence is 1024*12, I want to find the position of 1, thus the output should be 4 9 10 11… .However I don’t know the total number of 1 in this sequence, thus various length output is needed

In cpu programming, The code could be

uchar *data; //data sequence
int output; //output array
for( int i=0, i<1024
12,i++)
{
if(data[i]==1)
{
*output=i;
output++;
}
}

In gpu,if 1024 threads are invoked

uchar *data; //data sequence in share memory
int idx=threadIdx.x;//thread index
int *output; //output array in global

for( int i=0, i<12 , i++)
{
if(data[1024*i+idx]==1)
{
output=1024i+idx; //Line 1
output++;//Line 2
}
__syncthreads();
}

Problem in Line 1 & Line 2: access confliction may happen if more than two threads find 1 in one loop, and the ++ operator is dangerous since it may affect the value of pointer “output” for other threads’ operation

Question:
If applied atomic exchange to replace Line 1 & 2 like

atomicExch(output++,1024*i+idx);

1.Can it guarantee that all these two operations, including exchange and ++, are performed in one atomic transaction? If not, how to revise this code?

2.Is there better way to write this code?

Note: random order in output sequence is accepted.

Thank you for you help! External Image

No! You could use cuobjdump to see the “assembly” code that’s happening underneath. All these operators split into totally different machine instructions and there is no guarantee of atomicity between them whatsoever! It’s the same thing on CPU too!

Depending on what you’re doing(for example how sparse the 1’s are) you might try some different things. Consider looking up some simple to difficult GPU algorithms like reduction/sort/merge/collision-detection.

For this problem, you might want to try several ‘bins’ each with an “end pointer”(some integer.) Then different blocks/threads can write into different bins atomically, using atomicAdd(bin_count, 1) (and it’s return value which is the old value.)

(sry for previous post)

You should be fine if you atomically increment the counter, something like this:

int index; // outside the kernel, initialize it to 0

...

// inside the kernel

output[atomicAdd(&index, 1)] = 1024*i+idx;

...

Edit: atomicInc actually suits your situation better.

Sergey.

Thanks for your help, your code works well in my program, thank you!

Thanks for your answer, and the #5 told me a possible way to realize the function i need and it works, thank you ~:)