Error [atomicCAS with unsigned short] in atomicMax with (__half*, __half)

HI, we want to create atomicMax with (__half*, __half) inputs. Here is our code

__device__ static void atomicMax(__half* address, __half val){
  unsigned short* address_as_ushort = reinterpret_cast<unsigned short*>(address);
  unsigned short old = *address_as_ushort, assumed;
  do {
    assumed = old;
    old = atomicCAS(address_as_ushort, assumed, __half_as_ushort(__float2half(fmaxf(__half2float(val), __half2float(__ushort_as_half(assumed))))));
  } while (assumed != old);
}

And the compile error is

error: no instance of overloaded function "atomicCAS" matches the argument list 
argument types are: (unsigned short *, unsigned short, unsigned short)

We follow the official types of atomicCAS CUDA C++ Programming Guide (nvidia.com) to change inputs as unsigned int but have Error Code 1: Cuda Runtime (misaligned address)

So, how could we adapt atomicCAS? BTW we could not change the inputs of atomicCAS due to the function __half_as_ushort().

Thanks : )

The atomicCAS support for unsigned short int requires that you compile for a cc7.0 target or higher, e.g. -arch=sm_70

CUDA requires accesses be “naturally” aligned, including atomics. That is, if you are accessing a 32 bit type, you must have a 32-bit aligned address. An address for an arbitrary 16-bit location is not necessarily 32-bit aligned. An attempt to access such a location using a 32-bit access anyway will result in a “misaligned address” error.

It is possible to do “larger” atomics on “smaller” types, but it requires paying attention to this rule(*). Here is an example atomicAdd for fp16 on a device that does not support 16-bit atomics. It should hopefully give you enough of a roadmap to make a 16-bit atomicMax. The basic idea is:

  • determine an appropriate 32-bit aligned address that covers the location you want a 16-bit atomic on
  • determine, based on that, whether you are working on the upper or lower half of the 32-bit location
  • assemble appropriate input for atomicCAS based on upper/lower half
  • leave the half that you are not modifying unchanged
  • perform atomicCAS on the 32-bit location, using the assembled quantities to target the upper or lower half.

To be clear, if you are targetting devices of cc7.0 and above, just compile for the proper target. Then your atomicCAS usage should not throw a compile error. I don’t suggest using the larger/smaller method unless you need 16-bit atomic support on a device lower than cc7.0.

(*) another requirement that comes out of the larger/smaller approach is that the allocations for the areas you want to do atomics on must all be whole-number multiples of 32-bit quantities. For example, you could not allocate a single 16-bit global quantity, and then legally use this larger/smaller method. You must allocate a multiple of 32-bits, properly aligned to a 32-bit boundary.

Thanks for your advice. Our environment has 3 GPUs with --gpu-code=sm_86,sm_86,sm_61.
However, when we try to add

set(CMAKE_CUDA_ARCHITECTURES 86)

or

set(CUDA_ARCHITECTURE_FLAGS "86")

to our CMakeLists.txt, the above two command lines both failed, i.e., the atomicCAS still does not work. We try to print __CUDA_ARCH__ in code but found no such ‘identifier’.

Are there some problems when we using command lines in cmake? (version 3.20.6

Thanks again for your help.

Following the code in atomicAdd, we implemented

  unsigned int *address_as_ui = (unsigned int *)((char *)address - ((size_t)address & 2));
  unsigned int old = *address_as_ui;
  unsigned int assumed;
  do {
    assumed = old;
    __half hsum = __float2half(fmaxf(__half2float(__ushort_as_half(reinterpret_cast<size_t>(address) & 2 ? (old >> 16) : (old & 0xffff))), __half2float(val)));
    old = (size_t)address & 2 ? (old & 0xffff) | (__half_as_ushort(hsum) << 16)
                              : (old & 0xffff0000) | __half_as_ushort(hsum);
    old = atomicCAS(address_as_ui, assumed, old);
  } while (assumed != old);

but the result is still wrong

“result is still wrong”

Is not a useful directive for me.

And regarding CMake, I can’t help with that. It certainly appears from your cross postings that your problems now have to do with CMake, not CUDA programming.

The effective CMake command is set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_86,code=sm_86), for the reference of others. As for the calculation, I would continue on it. Thanks for your help