How can I make sure atomicAdd() was successful?


I want to use atomicAdd() in my kernel, but how do I know if it was successful? Is it possible for it to be unsuccessful at all, or does it keep retrying until it succeeds? (It seems it fails sometimes, or I’m doing something wrong.)

The documentation mentions that it returns the old value in memory (I can’t understand why, though). So I can’t find that out directly from the return value. If I check the value in memory after the atomicAdd() call, then I can’t be sure if the new value is because this thread’s call to atomicAdd() was successful or some other threads has written a new value to the memory location.

I’m using the following function to manipulate the blockIdx in my kernel (i.e., use the value returned by this function instead of blockIdx) to model a work queue. But I’m observing duplicate results in the output, which means that the scenario I explained is occurring (if my understanding is correct).

__device__ void get_next_block(int *executed)
	int b_old, b_new;
	do {
		b_old = atomicAdd(executed, 1);
		b_new = *executed;
	} while(b_new == b_old);

I also tried doing the addition outside the while loop (only once) and it didn’t fix the problem.

Any help on how I should do this is appreciated. I’m using a K40c card and CUDA driver 8.0.

atomicAdd will always be successful, assuming you have not violated basic programming principles (e.g. assuming your executed pointer points to a valid memory location, for example).

In your example, atomicAdd will always add 1 to the location pointed to by executed, every time it is called.

Based on your description, I’m reasonably sure that it should be sufficient to just do:

__device__ int get next_block(int *executed)
  return atomicAdd(executed, 1);

That will give each user a unique return value, again assuming basic programming principles, such that executed points to a valid global memory location, and it has been properly initialized, etc.

My guess is that you are getting confused by the behavior of e.g. atomicCAS, which does have a conditional nature associated with it. atomicAdd has no such conditional behavior. It always adds to the location pointed to.

So what else could I possibly be doing wrong? I allocate memory using cudaMalloc and keep the address in the pointer that I pass to the function. I then write value -1 to the address using cudaMemcpy.

Basically what I’m trying to do is to launch a smaller grid than the total number of blocks that need to be executed (say the original grid is 1024 blocks, but I want to launch 128 blocks), and then have a loop in my kernel that executes until the value pointed to by executed pointer exceeds the total number of blocks (1024 in the example). Thread 0 at each block calls get_next_block and then I call __syncthreads() so that all the threads can see the new block ID. When I do this for binomialOptions in CUDA samples, I fail the test and when I print the generated numbers, some are repeated multiple times. I thought unsuccessful atomicAdd() could explain what is happening, but if it always secceeds then I don’t know what I’m doing wrong.

I would write 0 to the location, not -1. The return of the “old” value means that the first requester of the function will get the initialized value.

However that doesn’t seem likely as an explanation for your observation(s).

Your general description seems reasonable to me. I don’t think I would be able to spot the problem without a complete test case. If you want to provide one, it should be as short as you can make while still demonstrating the problem. It’s quite possible that in the creation of the short test case, you may discover the problem yourself.

For me to help, your test case would need to be something I can copy, paste, compile, and run, and see the problem without having to add anything or change anything.

Thanks for your help, I solved the problem. The issue was that instead of using the return value of the atomicAdd(), I was reading the result from memory again. Because all the additions had been done, all the blocks after each iteration of the loop read the same value, but when I use the return value and write it in a location in shared memory for all the threads in a block to read, it passes the test.