Hello All,

I wonder anyone can hep me out about this.

I want to implement a parent kernel to calculate one metric from one time series (around 2000 time points) and a child kernel to replace the for loop (0-1998, each iteration has another sub-loop (from i+1 to 2000)). Each iteration will increase the number of appearance if curtain condition meets. My pseudo code is

**device** numcount(float *dat, int loc, int rows, int len, int *count)

{

if(threadIdx.x>rows-1) return;

int start=threadIdx.x;

// take the threadIdx.x -th row from the

int j;

for(j=start+1; j<len; j++)

{

if…

(*count)++;

}

}

**global** kernel(float *data, int loc, int rows, int len){

numcount<<<numbks, numthreads>>>(data, loc, rows, len, *count);

}

My question is how to avoid conflicts of increasing the same global variable count? In openmp, this can be done using reduction, but how to implement a similar process in CUDA?

Thanks

Ze