I have faced a problem in CUDA programming when dealing with various length output
The question is:
I have a unsigned char sequence in share memory which only contains 0,1 like 00010000111000…,the length of this sequence is 1024*12, I want to find the position of 1, thus the output should be 4 9 10 11… .However I don’t know the total number of 1 in this sequence, thus various length output is needed
In cpu programming, The code could be
uchar *data; //data sequence
int output; //output array
for( int i=0, i<102412,i++)
{
if(data[i]==1)
{
*output=i;
output++;
}
}
In gpu,if 1024 threads are invoked
uchar *data; //data sequence in share memory
int idx=threadIdx.x;//thread index
int *output; //output array in global
for( int i=0, i<12 , i++)
{
if(data[1024*i+idx]==1)
{
output=1024i+idx; //Line 1
output++;//Line 2
}
__syncthreads();
}
Problem in Line 1 & Line 2: access confliction may happen if more than two threads find 1 in one loop, and the ++ operator is dangerous since it may affect the value of pointer “output” for other threads’ operation
Question:
If applied atomic exchange to replace Line 1 & 2 like
atomicExch(output++,1024*i+idx);
1.Can it guarantee that all these two operations, including exchange and ++, are performed in one atomic transaction? If not, how to revise this code?
2.Is there better way to write this code?
Note: random order in output sequence is accepted.
Thank you for you help! External Image