For example, 64-bit atomicAdd can be constructed, along with certain access restrictions (only concurrent atomicAdd’s), as following:
// considering high and low 32-bit words
if (atomicAdd(addr_ui4_low, val_ui4_low) + val_ui4_low < val_ui4_low) // check if there WAS an overflow
atomicAdd(addr_ui4_high, 1);
atomicAdd(addr_ui4_high, val_ui4_high);
Actually from SM12 there exists 64-bit version of atomicAdd, but it does not work with shared memory, so the code above is sometimes useful.
Anyone knows similar solution for 64-bit emulation of atomicMin and atomicMax?
Use a spare memory address as a mutex. This lets you perform any operation atomically, even complex math. It’s not necessarily super efficient but it works.
Something like…
// mutex is a pointer to a word that has been initialized to 1
while (0==atomicExch(mutex, 0)); // keep attempting to get the mutex until we own it
// do whatever atomic operation(s) you like.. you own the mutex so you won't be stepped on by anyone else
*z=max(*z, 12345LL); // 64 bit max
atomicExch(mutex, 1); // give up the mutex
This mutex method works, but it’s inefficient. I use the idea for dynamic memory allocation on the device inside my raytracing kernel.
It can be sped up with significant coding by realizing that if you hold the mutex, your entire warp can coordinate with itself and simultaneously do its operations as long as they’re consistent and coordinated with only their sibling threads. This often lets you use only one mutex call instead of 32 of them.
I don’t think this will work. Threads of a single warp will get stuck on line
while (0==atomicExch(mutex, 0));
because first of them will own the mutex getting out of cycle (into divergent path), and than will wait until others get out of endless cycle (they can not own mutex so will iterate there eternally).
This code can only work only in occasion if there is at most one thread in a warp that get inside this “atomic” block.
I agree there could be a solution to synchronize whole warps, however this will be equivalent to managing aggregate (min or max or whatever) values for every thread in shared memory, and later reduce the array. Good approach, however eats much shared memory which is valuable enough to use it instead of atomic operations.
Actually you’re right, my own code doesn’t do it exactly this way for that reason… I got lazy when typing an answer to your question.
Something like:
int waiting=1; // I want an atomic access
while (waiting) {
if (atomicExch(mutex, 0)) { // we succeeded getting the mutex
// do whatever mutex-locked instructions you like here
waiting=0; // we succeeded in getting the mutex
atomicExch(mutex, 1); // release the mutex
}
}