update a global variable in child kernel

redhatw · December 13, 2018, 6:49pm

Hello All,
I wonder anyone can hep me out about this.
I want to implement a parent kernel to calculate one metric from one time series (around 2000 time points) and a child kernel to replace the for loop (0-1998, each iteration has another sub-loop (from i+1 to 2000)). Each iteration will increase the number of appearance if curtain condition meets. My pseudo code is

device numcount(float *dat, int loc, int rows, int len, int *count)
{
if(threadIdx.x>rows-1) return;
int start=threadIdx.x;
// take the threadIdx.x -th row from the
int j;
for(j=start+1; j<len; j++)
{
if…
(*count)++;
}
}
global kernel(float *data, int loc, int rows, int len){

numcount<<<numbks, numthreads>>>(data, loc, rows, len, *count);
}

My question is how to avoid conflicts of increasing the same global variable count? In openmp, this can be done using reduction, but how to implement a similar process in CUDA?

Thanks
Ze

Robert_Crovella · December 13, 2018, 7:57pm

It’s recommended to format code using the </> in the edit toolbar.

You could use an atomic update:

atomicAdd(count, 1);

Also note that if you intend to call numcount as a kernel:

numcount<<<numbks, numthreads>>>(data, loc, rows, len, *count);

You must define it with the global decorator:

__global__ numcount(...

redhatw · December 15, 2018, 4:06pm

Here is my simplified child kernel. fsum is a float array allocated using cudaMalloc, vm is an integer array allocated with cudaMalloc too.

__global__ void vecmatch_row(float *fsum, int *vm)
{
    int idx= threadIdx.x;
    if ...	 
	{
     atomicAdd( &(fsum + idx) , 1.5);  // just for testing
     atomicAdd( &(vm+idx), 1);
    
    }
  }  // end of the j loop
}

the code can run but nothing was added to the memory space. If I changed

atomicAdd( &(fsum + idx) , 1.5);

to be

atomicAdd( (fsum + idx) , 1.5);

I got an cuda error: invalid configuration argument.
Any further advice?
Thanks

Robert_Crovella · December 15, 2018, 4:11pm

This is not correct:

atomicAdd( &(fsum + idx) , 1.5);

This is correct:

atomicAdd( (fsum + idx) , 1.5f);

The invalid configuration argument error is coming from some other aspect of your code.

atomicAdd takes a pointer variable as its first argument

This is a pointer (to float):

fsum

this is also a pointer (to float):

fsum + idx

this is a pointer-to-pointer (i.e. a pointer to a float pointer):

&(fsum + idx)

it is not correct. You can discover this by studying the documentation for atomics

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

all of the atomic functions work on ordinary pointers.

This is also consistent with the example I gave you. In your original code, count is a pointer.

atomicAdd(count, 1);

redhatw · December 15, 2018, 6:41pm

I still couldn’t get the number updated. fsum is defined inside the main function and allocated using cudaMalloc, do I need to define it as a global value outside of the main function?
Could I do it like this way?

__device__ float *fsum;
void main()
{
...
cudaMalloc(&fsum, size);
}

Robert_Crovella · December 15, 2018, 6:45pm

No, that won’t work. cudaMalloc cannot allocate a pointer that is located in device memory

I think you probably have other errors in your code, and you are fixated on the idea that this is not working, when in fact something else in your code is not working.

I encourage you to use proper CUDA error checking, run your code with cuda-memcheck, and use a methodical debug approach that tackles the problems the tools are telling you, rather than jumping to the conclusion that somehow the atomic update is not working.

redhatw · December 16, 2018, 4:44am

The problem was that I didn’t call cudaDeviceSynchronize. I thought that is the same as __syncthreads().

Topic		Replies	Views
global variable in CUDA CUDA Programming and Performance	1	14294	January 27, 2011
Really simple while loop issues CUDA Programming and Performance	4	3186	October 27, 2014
CUDA BUG? atomicAdd CUDA Programming and Performance	1	6120	March 21, 2009
can one force two operations to occur atomically together? CUDA Programming and Performance	2	1473	June 30, 2015
Counting values --- what's wrong? CUDA first timer's naive code~ help please! CUDA Programming and Performance	2	984	March 22, 2009
help wanted global memory update CUDA Programming and Performance	3	1965	July 11, 2010
multi dimension array CUDA Programming and Performance	26	32777	February 12, 2010
how can multithreads modify the common variable in device memory? CUDA Programming and Performance	5	765	February 3, 2014
Updating Global Array by multiple thread/blocks CUDA Programming and Performance	3	4209	July 23, 2010
Many threads updating a single global variable CUDA Programming and Performance	7	6786	March 30, 2012

update a global variable in child kernel

Related topics