Construct 64-bit atomic operation from 32-bit ones? Is that possible for Min and Max

For example, 64-bit atomicAdd can be constructed, along with certain access restrictions (only concurrent atomicAdd’s), as following:

// considering high and low 32-bit words

if (atomicAdd(addr_ui4_low, val_ui4_low) + val_ui4_low < val_ui4_low) // check if there WAS an overflow

	atomicAdd(addr_ui4_high, 1); 

atomicAdd(addr_ui4_high, val_ui4_high);

Actually from SM12 there exists 64-bit version of atomicAdd, but it does not work with shared memory, so the code above is sometimes useful.

Anyone knows similar solution for 64-bit emulation of atomicMin and atomicMax?

Any ideas?

Use a spare memory address as a mutex. This lets you perform any operation atomically, even complex math. It’s not necessarily super efficient but it works.

Something like…

// mutex is a pointer to a word that has been initialized to 1

while (0==atomicExch(mutex, 0)); // keep attempting to get the mutex until we own it

// do whatever atomic operation(s) you like.. you own the mutex so you won't be stepped on by anyone else

*z=max(*z, 12345LL); // 64 bit max

atomicExch(mutex, 1); // give up the mutex

This mutex method works, but it’s inefficient. I use the idea for dynamic memory allocation on the device inside my raytracing kernel.

It can be sped up with significant coding by realizing that if you hold the mutex, your entire warp can coordinate with itself and simultaneously do its operations as long as they’re consistent and coordinated with only their sibling threads. This often lets you use only one mutex call instead of 32 of them.

I don’t think this will work. Threads of a single warp will get stuck on line

while (0==atomicExch(mutex, 0));

because first of them will own the mutex getting out of cycle (into divergent path), and than will wait until others get out of endless cycle (they can not own mutex so will iterate there eternally).

This code can only work only in occasion if there is at most one thread in a warp that get inside this “atomic” block.

I agree there could be a solution to synchronize whole warps, however this will be equivalent to managing aggregate (min or max or whatever) values for every thread in shared memory, and later reduce the array. Good approach, however eats much shared memory which is valuable enough to use it instead of atomic operations.

Actually you’re right, my own code doesn’t do it exactly this way for that reason… I got lazy when typing an answer to your question.

Something like:

int waiting=1; // I want an atomic access

while (waiting) {

   if (atomicExch(mutex, 0)) { // we succeeded getting the mutex

	 

	  // do whatever mutex-locked instructions you like here

	  waiting=0; // we succeeded in getting the mutex

	  atomicExch(mutex, 1); // release the mutex 

  }

}

That’s great, thank you. This code seems to be an alternative to a more stupid one:

for(int i=0; i<blockDim.x; i++)

  if (threadIdx.x == i)

  {

	// some code serialized across threads of the same block

  }

…but automatically skips inactive threads, and suitable for multi-block synchronization (if mutex is global).

Thanks again.