atomicAnd function for an unsigned short value

Hallo cuda fans,

I need a logical AND function for an unsigned short value. I will clear the MSB and MSB - 1 Bit but the atomicAnd function is only for a 32Bit integer but I use an unsigned short value.

1 Like

No problem:

__device__ short atomicAndShort(short* address, unsigned short val)


    unsigned int *base_address = (unsigned int *) ((char *)address - ((size_t)address & 2));

    int shift = ((size_t)address & 2) << 3;

    unsigned int long_val = (unsigned int)val << shift) | (0xffff0000u >> shift);

return (atomicAnd(base_address, long_val) >> shift) & 0xffff;


Caveat: untested!

EDIT: Fix function name and bug spotted by sergeyn (see below)

is this alright that this code modifies bytes around *address ? I guess you at least need to compose a mask of 1s around the *address so that the data there does not change.

And I’m not sure if you should be expecting coherent result in case you later read that memory as shorts - you read/write dwords, and then read a part of that memory as short - is this guaranteed to work as expected ?

In general, my recommendation would be to work with the native types supported by instructions you need.

Oops! Thanks for pointing out the bug. I fixed it in the original posting.

Reading and writing the same memory with different word sizes should not be a problem as CUDA does not use different access paths based on word size. You are right of course that using the native types would be better as all the shift and mask operations would be unnecessary. Still if you are short on memory operations on smaller types might be needed.

It does at least use different instructions to access bytes,words,dwords afaik ? In this case it is up to the hardware to resolve the order ?


You are right that in principle the hardware could use this as hint in a kind of “hardware alias analysis” to allow reordering of accesses of different width. In practice however it is a commonly used optimization technique to use wider type memory accesses than the base types. E.g. memset() will usually use 32-bit or wider accesses even though the interface is based on bytes.

If this unsigned short is a single variable, it can be changed to uns int, if it’s an array of values, then atomicAnd is supposed to be executed in parallel, but on the last step the whole int will be blocked to execute atomicAnd, so half of threads will wait. So, does it worth at all to do atomicAnd for short?

If you know your threads are accessing strictly consecutive memory locations, you don’t need atomic operations at all…

It is already mentioned above that it is generally faster to use native data types for atomic operations, but sometimes these cannot be used.