How to use atomicCAS() to implement atomicAdd(short)? Trouble adapting programming guide example

[SOLUTION]

For anyone coming across this thread anew, I thought I’d present up front the code that I am currently implementing. I am very glad to say that it successfully solves the issue posed in the thread topic. You can read about the details of the implementation in the comments below, as well as at this thread. Thanks goes to tera and Synvain Collange for their efforts and willingness to share. This has been very fruitful and your time has been much appreciated.

__device__ short atomicAddShort(short* address, short val)

{

    unsigned int *base_address = (unsigned int *) ((char *)address - ((size_t)address & 2));	//tera's revised version (showtopic=201975)

    unsigned int long_val = ((size_t)address & 2) ? ((unsigned int)val << 16) : (unsigned short)val;

unsigned int long_old = atomicAdd(base_address, long_val);

    if((size_t)address & 2) {

        return (short)(long_old >> 16);

    } else {

        unsigned int overflow = ((long_old & 0xffff) + long_val) & 0xffff0000;

        if (overflow)

            atomicSub(base_address, overflow);

        return (short)(long_old & 0xffff);

    }

}

This has been successfully applied to signed [font=“Courier New”]short[/font]'s, and subtraction is achieved by passing a negative [font=“Courier New”]val[/font].

Overflow and underflows are handled gracefully, and do not spill over into the adjacent [font=“Courier New”]short[/font].

Cheers,

Mike

You need to use atomicCAS() on a 32-bit integer and extract low and high 16-bit word yourself, similar to what I did in another thread for 8-bit integers.

thanks, I missed that in the search but will have a close look at it

cheers

Hi tera,

Thanks for pointing me to your details discussion in the previous link. However, I’m struggling to understand how to to properly extract the lower or upper half-word (and how to tell which half it’s in). Particularly how to use the hex offsets as you’ve done in your example.

The [font=“Lucida Console”]__byte_perm()[/font] explanation in the programming guide with all the selectors and nibbles is confusing :sad:

Below is adapted from your code, but I have no idea how to use the addresses in red:

__device__ short atomicAddShort(short* address, short val)

{

    unsigned int *base_address = (unsigned int *)((size_t)address & ~3);

    unsigned int selectors[] = {0x3214, 0x3240, 0x3410, 0x4210};    //don't now what to use here

    unsigned int sel = selectors[(size_t)address & 3];

    unsigned int old, assumed, sum_, new_;

old = *base_address;

    do {

        assumed = old;

        sum_ = val + (short)__byte_perm(old, 0, ((size_t)address & 3) | 0x4440));    //or here

        new_ = __byte_perm(old, sum_, sel);

        if (new_ == old)

            break;

        old = atomicCAS(base_address, assumed, new_);

    } while (assumed != old);

    return old;

}

If you could provide some further explanation I would be grateful.

Cheers,

Mike

Yeah, I’d say for [font=“Courier New”]short[/font] it’s not worth going through all that. Just use shift and masks directly:

__device__ short atomicAddShort(short* address, short val)

{

    unsigned int *base_address = (unsigned int *)((size_t)address & ~2);

    unsigned int old, assumed, new_;

old = *base_address;

    do {

        assumed = old;

        new_ = ((size_t)address & 2) ? old + ((unsigned int)val << 16) : (old & 0xffff0000) | ((old & 0xffff) + val);

       old = atomicCAS(base_address, assumed, new_);

    } while (assumed != old);

    return old;

}

Or, if you have spent some time on understanding __byte_perm() and now would like to see how it would work:

__device__ char atomicAddShort(short* address, short val)

{

    unsigned int *base_address = (unsigned int *)((size_t)address & ~2);

    unsigned int old, assumed, sum, new_;

old = *base_address;

    do {

        assumed = old;

        sum = val +  (short)__byte_perm(old, 0, ((size_t)address & 2) ? 0x4432 : 0x4410);

        new_ = __byte_perm(old, sum, ((size_t)address & 2) ? 0x5410 : 0x3254);

        old = atomicCAS(base_address, assumed, new_);

    } while (assumed != old);

    return old;

}

tera, your instructive examples are most appreciated.

From my reading on bit manipulation, your first example does the following:

[list=1]

Calculates the base [font=“Courier New”]address[/font] of the 32-bit [font=“Courier New”]int[/font] containing our target [font=“Courier New”]short[/font] by dropping the lowest bit in the argument pointer address; then

Checks if the [font=“Courier New”]short[/font] we’re after is in the upper half word; and

    [indent]If so, then shift the [font=“Courier New”]short[/font] val up to the upper half word and adds it to what’s there; or

    If not, then preserve the upper half word but add our [font=“Courier New”]short[/font] to the lower half word, and combines the two halves using a bitwise ‘or’; and finally[/indent]

Update the entire [font=“Courier New”]int[/font] value using [font=“Courier New”]atomicCAS()[/font].

Is that right?

This has greatly helped my understanding of these bitwise operations. But a couple of questions:

[list=1]

Why does the construct [font=“Courier New”]& ~1[/font] get prior 32-bit [font=“Courier New”]int[/font] address in this case? Whereas in your [font=“Courier New”]char[/font] example you used [font=“Courier New”]& ~3[/font]?

And why do we even need to match to the previous 32-bit [font=“Courier New”]int[/font] address? Why can’t we just take the address as given and replace the first 16 bits of the NEXT 32-bit sequence starting with [font=“Courier New”]address[/font]?

You’ve used a bitwise ‘or’ to combine the adjusted lower half word in the second case, but simply used addition in the first. Why do you only need to leftshift with addition in the first case, and not leftcrap with bitwise ‘or’?

Thanks again for providing the __byte_perm() algorithm. I’ll need to study it more before I can make sense of it, but at this stage I’m closer to understanding the shift & mask technique (and it seems cleaner to me - and quicker?)

I forgot to mention that I’ll have a play with the code you provided first thing tomorrow when I’m back at uni :smile:

Yes, that’s right.

[list=1]

You actually caught a bug here - we need only one bit to distinguish between upper and lower hlaf-words (where the char example needed two to address the four bytes in a word individually). But that is bit No. 1, not No. 0. I’ll fix the example code…

The hardware only allows aligned memory accesses. As far as I know current hardware just discards the lower bits, so we would not need to mask them off. The PTX manual defines the behavior as undefined though (Nvidia probably wants to keep its options open on allowing unaligned accesses later). So to also work on future devices, we mask the bits off.

As long as no carry is generated (and we don’t), bitwise or and addition are the same. As the cost is also the same (apart from minimal differences in power consumption maybe), I might well have used one in one example and another in a different example.

Yes, it’s probably quicker. Note that this is entirely memory bound though, so computational efficiency does not matter.

Hi tera,

At the moment I’m getting an expression must have integral or enum type error on the [font=“Courier New”]new_ =[/font] line with the ternary operator.

I’ve tried casting some of the values to [font=“Courier New”]unsigned int[/font], thinking because new_ is of type [font=“Courier New”]unsigned int[/font]. But to no avail.
I must say I still don’t understand enough about the bit operations to be able to debug this error.

Can ‘val’ be negative? Can overflows occur?

If the answer to both questions is “no”, then wouldn’t something like the following code do the trick?

__device__ unsigned short atomicAddShort(unsigned short* address, unsigned short val)

{

    unsigned int *base_address = (unsigned int *)((size_t)address & ~2);

    unsigned int long_val = ((size_t)address & 2) ? ((unsigned int)val << 16) : val;

unsigned int long_old = atomicAdd(base_address, long_val);

    return ((size_t)address & 2) ? (unsigned short)(long_old >> 16) : (unsigned short)(long_old & 0xffff);

It assumes no carry propagation from the lower part to the higher part may occur, which is safe as long as the lower part does not overflow.

Even if that’s not the case, we can still work around it:

__device__ unsigned short atomicAddShort(unsigned short* address, unsigned short val)

{

    unsigned int *base_address = (unsigned int *)((size_t)address & ~2);

    unsigned int long_val = ((size_t)address & 2) ? ((unsigned int)val << 16) : val;

unsigned int long_old = atomicAdd(base_address, long_val);

    if((size_t)address & 2) {

        return (unsigned short)(long_old >> 16);

    }

    else {

        unsigned short old = (unsigned short)(long_old & 0xffff);

        if(old + val < old) {

             // Oops, overflow in lower part occurred

             atomicAdd(base_address, -(1 << 16)); // Fix it by subtracting back the carry

        }

        return old;

    }

(disclaimer: this is completely untested code)

Sorry, wasn’t my best of days… I missed the cast to [font=“Courier New”]size_t[/font] in that place (as bit operations on pointers are not allowed). And while at it, I also fixed the selectors for the [font=“Courier New”]__byte_perm()[/font] example.

Great idea, Sylvain! And it works for signed short, too:

__device__ short atomicAddShort(short* address, short val)

{

    unsigned int *base_address = (unsigned int *)((size_t)address & ~2);

    unsigned int long_val = ((size_t)address & 2) ? ((unsigned int)val << 16) : (unsigned short)val;

unsigned int long_old = atomicAdd(base_address, long_val);

    if((size_t)address & 2) {

        return (short)(long_old >> 16);

    } else {

        unsigned int overflow = ((long_old & 0xffff) + long_val) & 0xffff0000;

        if (overflow)

            atomicSub(base_address, overflow);

        return (short)(long_old & 0xffff);

    }

}

Sylvain and tera, thank you both so much for taking the time to continuing to contribute and update this thread.

tera has lost me somewhat that last example, and Sylvain’s second one also baffles at the moment. But instead of asking more silly questions I will be sure to study these more closely on my own.

I can say that the usage case will only require [font=“Courier New”]unsigned short[/font]'s and can be assumed to not overflow in the positive direction (the maximum count the [font=“Courier New”]short[/font]'s will need to track can be determined prior to kernel launch and if more than 65535 is required, I guess we’ll have to fall back to regular [font=“Courier New”]unsigned int[/font]'s – at a substantial shared mem cost, but I don’t think that can be avoided?).

However, I will need to reduce the magnitude of the [font=“Courier New”]unsigned short[/font], therefore a version of [font=“Courier New”]atomicSubShort()[/font] is required.

Using Sylvain’s first example, I’m simply using [font=“Courier New”]atomicSub()[/font] in that construction:

__device__ unsigned short atomicSubShort(unsigned short* address, unsigned short val)

{

    unsigned int *base_address = (unsigned int *)((size_t)address & ~2);

    unsigned int long_val = ((size_t)address & 2) ? ((unsigned int)val << 16) : val;

unsigned int long_old = atomicSub(base_address, long_val);

    return ((size_t)address & 2) ? (unsigned short)(long_old >> 16) : (unsigned short)(long_old & 0xffff);

}

Is there any difference if I change the argument to a [font=“Courier New”]signed short val[/font], and continue to use [font=“Courier New”]atomicAdd()[/font] with a negative [font=“Courier New”]val[/font]?

Now the compiler is complaining that: External calls are not supported (found non-inlined call to _Z14atomicAddShortPts)

On the line in [font=“Courier New”]global[/font] that calls the new atomic function.

What does it mean?

[EDIT]
Sorry, I fixed it. The function prototype I had did not match the modified type definitions.

Tera: Yes, it is just modulo arithmetic, so it works with signed numbers.

But if many negative numbers are expected, it will be more efficient to switch from the two’s-complement representation to a biased representation. That way, unsigned overflows won’t occur under normal circumstances.

In practice, this amounts to just flipping the sign bit:

__device__ short atomicAddShort(short* address, short val)

{

    unsigned int *base_address = (unsigned int *)((size_t)address & ~2);

    val = val ^ 0x8000; // 2's complement to biased

    unsigned int long_val = ((size_t)address & 2) ? ((unsigned int)val << 16) : (unsigned short)val;

unsigned int long_old = atomicAdd(base_address, long_val);

    if((size_t)address & 2) {

        return (short)((long_old >> 16) ^ 0x8000);  // biased to 2's complement

    } else {

        unsigned int overflow = ((long_old & 0xffff) + long_val) & 0xffff0000;

        if (overflow)

            atomicSub(base_address, 1);

        return (short)((long_old & 0xffff) ^ 0x8000);  // biased to 2's complement

    }

}

Yes, that should also work, as long as the value you are adding/subtracting to does not get negative (underflows).

(Did not know about atomicSub, actually…)

In fact, it works even in case of overflows and underflows, as long as the final result is in range. Spurious carries originating from overflow conditions will eventually be canceled out by borrows originating from underflows…

Ok now I’m completely gone with this ‘biased’ notation. More to the reading list.

But I quite like how the overflow handling code you guys provided effectively ‘clamps’ the overflow to the lower bits, and does not affect the higher bits. I don’t think overflow or underflow should occur in a properly designed simulation, but having the over/underflows handled in this way is nice for post-processing validation.

With this in mind, what is the best way to implement a pair of [font=“Courier New”]atomicAddShort()[/font] and [font=“Courier New”]atomicSubShort()[/font] such that both overflows and underflows would wrap around in the relevant half word only (the memory data needs to be unsigned only)?

Thanks again.

The following test code (which imitates memory access patterns in my actual simulation) works perfectly using tera’s longer example (which handles overflows and [font=“Courier New”]signed short[/font]s).

__global__ void test_kernel ( short* d_data)

{

	__shared__ short s_data[4];

	atomicAddShort(&d_data[0], 32767);

	atomicAddShort(&d_data[1], -32768);

	atomicAddShort(&d_data[2], 1);

	atomicAddShort(&d_data[3], -1);

	s_data[0] = d_data[0];

	s_data[1] = d_data[1];

	s_data[2] = d_data[2];

	s_data[3] = d_data[3];

	d_data[2] = s_data[2]-1;

	d_data[3] = s_data[3]+1;

	d_data[0] = s_data[0]+1;

	d_data[1] = s_data[1]-1;

}

//tera's signed version

__device__ short atomicAddShort(short* address, short val)

{

    unsigned int *base_address = (unsigned int *)((size_t)address & ~2);

    unsigned int long_val = ((size_t)address & 2) ? ((unsigned int)val << 16) : (unsigned short)val;

unsigned int long_old = atomicAdd(base_address, long_val);

    if((size_t)address & 2) {

        return (short)(long_old >> 16);

    } else {

        unsigned int overflow = ((long_old & 0xffff) + long_val) & 0xffff0000;

        if (overflow)

            atomicSub(base_address, overflow);

        return (short)(long_old & 0xffff);

    }

}

int main (void)

{

	short h_data[4] = {0};

	short* d_data;

	cudaMalloc((void**)&d_data, 4*sizeof( short));

	cudaMemcpy(d_data, h_data, 4*sizeof( short), cudaMemcpyHostToDevice);

test_kernel<<<1,1>>>(d_data);

	cudaMemcpy(h_data, d_data, 4*sizeof( short), cudaMemcpyDeviceToHost);

	printf("%d\n%d\n%d\n%d\n", h_data[0], h_data[1], h_data[2], h_data[3]);

return 0;

}

However, I am experiencing memory access violations in my simulation kernel, after replacing the old 1D [font=“Courier New”]int[/font] arays in global mem and shared mem with [font=“Courier New”]signed short[/font], and using the same type of memory accesses outlined above.

The only difference is in the real code there is a [font=“Courier New”]shared unsigned char s_chararray[512][/font] followed directly by [font=“Courier New”]shared short s_shortarray[512][/font].

Also in the kernel parameters list there are a whole bunch of pointers to [font=“Courier New”]char [/font]arrays and [font=“Courier New”]short [/font]arrays in global memory. There was no problem when the [font=“Courier New”]short [/font]arrays used to be regular [font=“Courier New”]int [/font]arrays, both in global and shared mem.

Could the memory violations be due to some sort of alignment issue either in global or shared mem?

I have just discovered that changing the above code so that we are using [font=“Courier New”]atomicAddShort()[/font] on shared mem rather than global mem like this does NOT work as expected:

__global__ void test_kernel ( short* d_data)

{

	__shared__ short s_data[4];

	s_data[0] = d_data[0];    //all d_data is initially zero

	s_data[1] = d_data[1];

	s_data[2] = d_data[2];

	s_data[3] = d_data[3];

	atomicAddShort(&s_data[0], 32767);

	atomicAddShort(&s_data[1], -32768);

	atomicAddShort(&s_data[2], 1);

	atomicAddShort(&s_data[3], -1);

	d_data[2] = s_data[2]-1;

	d_data[3] = s_data[3]+1;

	d_data[0] = s_data[0]+1;

	d_data[1] = s_data[1]-1;

}

I have no idea why. Is this to do with alignment issues?

What compute capability is your device? Shared memory atomics require 1.2 or higher.

Does the example still work if the atomicAdd is performed in shared mem?

__global__ void test_kernel ( short* d_data)

{

        __shared__ short s_data[4];

s_data[0] = d_data[0];

        s_data[1] = d_data[1];

        s_data[2] = d_data[2];

        s_data[3] = d_data[3];

atomicAddShort(&s_data[0], 32767);

        atomicAddShort(&s_data[1], -32768);

        atomicAddShort(&s_data[2], 1);

        atomicAddShort(&s_data[3], -1);

d_data[2] = s_data[2]-1;

        d_data[3] = s_data[3]+1;

        d_data[0] = s_data[0]+1;

        d_data[1] = s_data[1]-1;

}

By the way, the example tests none of the “critical” properties of the atomicAdd() routine - atomicity or overflows.

I am testing this on a compute 1.3 Tesla C1060.

I have narrowed it down to the fact that global mem works (updates correctly, with overflow wraparound) but not shared mem. The final +1 and -1’s on the s_data[0] and s_data[1] before storing back to d_data should cause positive and negative overflow respectively. Both are fine if dealing with global mem. But both results become zero in shared mem.

[EDIT] In fact this returns zero everywhere:

__global__ void test_kernel ( short* d_data)

{

	__shared__ short s_data[4];

	s_data[0] = d_data[0];

	s_data[1] = d_data[1];

	s_data[2] = d_data[2];

	s_data[3] = d_data[3];

	atomicAddShort(&s_data[0], 1);

	atomicAddShort(&s_data[1], -1);

	atomicAddShort(&s_data[2], 1);

	atomicAddShort(&s_data[3], -1);

	d_data[2] = s_data[2];

	d_data[3] = s_data[3];

	d_data[0] = s_data[0];

	d_data[1] = s_data[1];

}

I am assuming atomicity will be ok as it relies on the inbuilt atomicAdd() function?? (but once this issue is resolved I can validate this against my earlier int array results)