update a global variable in child kernel

Hello All,
I wonder anyone can hep me out about this.
I want to implement a parent kernel to calculate one metric from one time series (around 2000 time points) and a child kernel to replace the for loop (0-1998, each iteration has another sub-loop (from i+1 to 2000)). Each iteration will increase the number of appearance if curtain condition meets. My pseudo code is

device numcount(float *dat, int loc, int rows, int len, int *count)
if(threadIdx.x>rows-1) return;
int start=threadIdx.x;
// take the threadIdx.x -th row from the
int j;
for(j=start+1; j<len; j++)
global kernel(float *data, int loc, int rows, int len){

numcount<<<numbks, numthreads>>>(data, loc, rows, len, *count);

My question is how to avoid conflicts of increasing the same global variable count? In openmp, this can be done using reduction, but how to implement a similar process in CUDA?


It’s recommended to format code using the </> in the edit toolbar.

You could use an atomic update:

atomicAdd(count, 1);

Also note that if you intend to call numcount as a kernel:

numcount<<<numbks, numthreads>>>(data, loc, rows, len, *count);

You must define it with the global decorator:

__global__ numcount(...

Here is my simplified child kernel. fsum is a float array allocated using cudaMalloc, vm is an integer array allocated with cudaMalloc too.

__global__ void vecmatch_row(float *fsum, int *vm)
    int idx= threadIdx.x;
    if ...	 
     atomicAdd( &(fsum + idx) , 1.5);  // just for testing
     atomicAdd( &(vm+idx), 1);
  }  // end of the j loop

the code can run but nothing was added to the memory space. If I changed

atomicAdd( &(fsum + idx) , 1.5);

to be

atomicAdd( (fsum + idx) , 1.5);

I got an cuda error: invalid configuration argument.
Any further advice?

This is not correct:

atomicAdd( &(fsum + idx) , 1.5);

This is correct:

atomicAdd( (fsum + idx) , 1.5f);

The invalid configuration argument error is coming from some other aspect of your code.

atomicAdd takes a pointer variable as its first argument

This is a pointer (to float):


this is also a pointer (to float):

fsum + idx

this is a pointer-to-pointer (i.e. a pointer to a float pointer):

&(fsum + idx)

it is not correct. You can discover this by studying the documentation for atomics


all of the atomic functions work on ordinary pointers.

This is also consistent with the example I gave you. In your original code, count is a pointer.

atomicAdd(count, 1);

I still couldn’t get the number updated. fsum is defined inside the main function and allocated using cudaMalloc, do I need to define it as a global value outside of the main function?
Could I do it like this way?

__device__ float *fsum;
void main()
cudaMalloc(&fsum, size);

No, that won’t work. cudaMalloc cannot allocate a pointer that is located in device memory

I think you probably have other errors in your code, and you are fixated on the idea that this is not working, when in fact something else in your code is not working.

I encourage you to use proper CUDA error checking, run your code with cuda-memcheck, and use a methodical debug approach that tackles the problems the tools are telling you, rather than jumping to the conclusion that somehow the atomic update is not working.

The problem was that I didn’t call cudaDeviceSynchronize. I thought that is the same as __syncthreads().