Custom atomicMax for int2 type

CudaaduC · January 20, 2014, 2:42am

Trying to implement a custom atomicMax() for the int2 type, where:

atomicMax(&b[0],a);

if(a.x>b[0].x) it updates the int2 b[0] to the x and y values of int2 a
if(a.x==b[0].x && a.y < b[0].y) then it sets b.y to a.y

for the device function I am not sure about the casting.

I have another way of accomplishing the same thing, but was wondering if anyone had an idea of how to correctly implement.

Arakageeta · January 22, 2014, 1:41am

Atomic operations are meant to be executed by a single hardware instruction. You can only fake an atomic instruction by way of a mutex (i.e., spinlock). You might be able to do what you want with atomicCAS(), if only it returned a success/error code like GCC’s __sync_bool_compare_and_swap(). Sadly, it does not. I think your only option may be to implement a spinlock with atomicCAS() and perform the max operation within the critical section. I believe you’d want to make the assignments within the critical section atomic operations as well (or maybe mark the variables as volatile?) if you need to share the value between CUDA blocks (the L1 caches are incoherent, so you have to play nasty tricks such as this).

CudaaduC · January 22, 2014, 2:45am

Thank you for your informative post.

I will probably go a different route, but it is an interesting problem and I would imagine I am not the first to want such functionality in atomic operations.

pasoleatis · January 22, 2014, 10:52am

atomic functions are used for the operations with race conditions. The imply that the address in memory which is read is not read by another thread until the operation is finished. This is done with atomicCAS. Check in the programming guide there is a prototype of an atomicAdd for double precision elements Programming Guide :: CUDA Toolkit Documentation

__device__ double atomicAdd(double* address, double val)
{
    unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;
    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val +
                               __longlong_as_double(assumed)));
    } while (assumed != old);
    return __longlong_as_double(old);
}

Arakageeta · January 24, 2014, 5:39pm

Following pasoleatis’s advice, I believe you get:

__device__ int2 atomicMax(int2* address, int2 val)
{
    unsigned long long* addr_as_ull = (unsigned long long*)address;
    unsigned long long  old = *addr_as_ull;
    unsigned long long  assumed;
    do {
        assumed = old;
        int2* temp = (int2*)&assumed;
        if (val.x > temp->x || (val.x == temp->x && val.y > temp->y))
            old = atomicCAS(addr_as_ull, assumed, *(unsigned long long*)&val);
        else
            break;
    }while(assumed != old);
    return *((int2*)&old);
}

This only works because sizeof(int2) == sizeof(unsigned long long). You would need to take a spinlock-approach for types with sizes > sizeof(unsigned long long), such as int4.

Note: The code above reverses the y-component comparison specified by the OP in order to use a standard/generic tuple comparison. (The OP basically requests an atomicMin on the y-component if the x-components are equal.)

pasoleatis · January 24, 2014, 6:47pm

I would be cool, if atomic operations could be hard coded in the gpu for arbitrary types.

Arakageeta · January 25, 2014, 1:07am

You need to get into transactional memory to support “atomics” on data larger than a word size (8 bytes on x86_64 and CUDA). Transactional memory is a relatively new technology and it is just starting to appear on Intel CPUs. It’s hard to imagine a use-case for transactional memory on the GPU that would be useful to computer graphics.

CudaaduC · January 25, 2014, 11:24pm

Arakageeta:

Following pasoleatis’s advice, I believe you get:
__device__ int2 atomicMax(int2* address, int2 val)
{
    unsigned long long* addr_as_ull = (unsigned long long*)address;
    unsigned long long  old = *addr_as_ull;
    unsigned long long  assumed;
    do {
        assumed = old;
        int2* temp = (int2*)&assumed;
        if (val.x > temp->x || (val.x == temp->x && val.y > temp->y))
            old = atomicCAS(addr_as_ull, assumed, *(unsigned long long*)&val);
        else
            break;
    }while(assumed != old);
    return *((int2*)&old);
}
This only works because sizeof(int2) == sizeof(unsigned long long). You would need to take a spinlock-approach for types with sizes > sizeof(unsigned long long), such as int4.

Note: The code above reverses the y-component comparison specified by the OP in order to use a standard/generic tuple comparison. (The OP basically requests an atomicMin on the y-component if the x-components are equal.)

Bingo, that works great! I did modify it slightly to fit my situation, but that prototype really helps when I have a situation where I am looking for a max value, and need to evaluate/cache another variable associated with that value.

Even if the a way for creating a custom atomic max was limited to 8 bytes(for an arbitrary type), it still would be very useful.

Thanks!

Edit: In a few cases not getting the right val.y answer, trying to figure out why. Will report when I know more…

CudaaduC · January 26, 2014, 1:52am

I realized there was another issue in the code which had some serial aspect so ran a basic scan test with the int2 type.
Based on those results this int2 atomicMax does indeed seem to work.

CudaaduC · January 26, 2014, 5:21am

Ok, just to make sure I am understand this correctly, lets say I want to search some problem space for an optimal answer(in this case a maximum value), and want to cache both that answer AND some other 32-bit value which was associated with that answer, would this be the correct way to implement as an atomicMax ?

__device__ int2 atomicMax_With_ID(int2* address, int2 val){
    unsigned long long* addr_as_ull = (unsigned long long*)address;
    unsigned long long  old = *addr_as_ull;
    unsigned long long  assumed;
    do{
        assumed = old;
        int2* temp = (int2*)&assumed;
        if(val.x > temp->x)
            old = atomicCAS(addr_as_ull, assumed, *(unsigned long long*)&val);
        else
            break;
    }while(assumed != old);
    return *((int2*)&old);
}

In this case the .x value is the maximum which is being compared, and the .y value is the ID number. Since the comparison only involves the .x value, and it is an aligned 8 byte word, the other 4 bytes represent the ID value associated with that .x value.

I made some small tests and it seemed to work correctly, but maybe I am missing something. Usually when I need an optimum value and the ID, I do not use atomics, but it would be nice to have that option.

This version (without using a global cache array to store block-best values) is only 5-15% slower than the standard shared reduction using __shfl() and writing once per block to global cache. Rather in this test case I use a single int2 pointer which is updated by all blocks.

I still reduce using __shfl() across the thread block and still use a small amount of shared memory, but only use that single global int2 pointer for this atomicMax update when the whole block has been reduced down and threadIdx.x==0.

Any other negatives to this approach?