How can I use atomicSub for floats and doubles?

I would like to use atomicSub but for floats and doubles. From what I understand of the documentation, this is not yet supported and the built in atomicSub only works with integers.

In the documentation they show how one can go about implementing atomicAdd for doubles for older GPUs with atomicCAS(), therefore I thought I would be able to easily implement my own version of atomicSub for floats by modifying it like this:

__device__ double atomicSubFloat(float* address, float val)
{
    unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;

    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val -
                               __longlong_as_double(assumed)));

    // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);

    return __longlong_as_double(old);
}

But it doesn’t seem to work. Presumably “__double_as_longlong” and ” __longlong_as_double” should be something else when working with floats, but I don’t know what to replace them with. Does anyone know how to use atomicSub for floats and doubles?

I"m sure this is “fixable” by using e.g. __float_as_int, but is it possible to use atomicAdd while changing the sign of the addend?

atomicAdd(address, -val);

?

Yes, that is in fact the exact workaround I’m using at the moment. But the kernel I’m doing that in is the largest bottleneck in my entire program according to the profiler. So I thought that it would be slightly more efficient to call atomicSub directly instead of adding the “unnecessary” multiplication of negative one.

Your proposal doesn’t call atomicSub directly (at least, not the built-in version of atomicSub). It is calling atomicCAS. And there is additional code involved, besides just the call to atomicCAS.

The mechanism you’re proposing can be made to work correctly, but it will be less efficient than calling atomicAdd with an additional multiplication.

That’s too bad. If you are sure that the custom atomicSub will be slower than built-in atomicAdd + the extra multiplication I will trust your judgement and keep using my current code. Hopefully nvidia will release a built-in atomicSub for floats/doubles in the near future, it would be handy to have available.

device floating point atomicSub operations could be realized like this:

__device__ double my_atomicSub(double* address, double val) {
 unsigned long long int* address_as_ull = (unsigned long long int*)address;
 unsigned long long int old = *address_as_ull, assumed;
 do {
      assumed = old;
      old = atomicCAS(address_as_ull, assumed, __double_as_longlong(__longlong_as_double(assumed) - val)); // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);
  return __longlong_as_double(old);
}

__device__ float my_atomicSub(float* address, float val) {
 int* address_as_int = (int*)address;
 int old = *address_as_int, assumed;
 do {
      assumed = old;
      old = atomicCAS(address_as_int, assumed, __float_as_int(__int_as_float(assumed) - val)); // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);
  return __int_as_float(old);
}

test case:

$ cat t278.cu
#include <stdio.h>

__device__ double my_atomicSub(double* address, double val) {
 unsigned long long int* address_as_ull = (unsigned long long int*)address;
 unsigned long long int old = *address_as_ull, assumed;
 do {
      assumed = old;
      old = atomicCAS(address_as_ull, assumed, __double_as_longlong(__longlong_as_double(assumed) - val)); // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);
  return __longlong_as_double(old);
}

__device__ float my_atomicSub(float* address, float val) {
 int* address_as_int = (int*)address;
 int old = *address_as_int, assumed;
 do {
      assumed = old;
      old = atomicCAS(address_as_int, assumed, __float_as_int(__int_as_float(assumed) - val)); // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);
  return __int_as_float(old);
}

template <typename T>
__global__ void k(T *addr, T val){

  my_atomicSub(addr, val);
}

int main(){

  float *d_f, h_f = 0.0f;
  double *d_d, h_d = 0.0;

  cudaMalloc(&d_f, sizeof(float));
  cudaMalloc(&d_d, sizeof(double));
  cudaMemset(d_f, 0, sizeof(float));
  cudaMemset(d_d, 0, sizeof(double));

  k<<<1,5>>>(d_f, 1.0f);
  k<<<1,6>>>(d_d, 1.0);
  cudaMemcpy(&h_f, d_f, sizeof(float), cudaMemcpyDeviceToHost);
  cudaMemcpy(&h_d, d_d, sizeof(double), cudaMemcpyDeviceToHost);
  printf("float result: %f\n", h_f);
  printf("double result: %f\n", h_d);
  return 0;
}
$ nvcc -o t278 t278.cu
$ cuda-memcheck ./t278
========= CUDA-MEMCHECK
float result: -5.000000
double result: -6.000000
========= ERROR SUMMARY: 0 errors
$

Is it better or worse to use uint32_t and uint64_t instead of int and long long int? I see https://stackoverflow.com/questions/15514286/way-to-get-floating-point-special-values-in-cuda which also seem to be using long long int. Also are long and long long equivalent in terms of bits in cuda?

Avoid the use of long {int}. In CUDA, because it allows the sharing of code between host and device, the size of elementary types is necessarily identical between host and device. On a Windows host, sizeof (long int) == sizeof (int) == 4, but on a Linux host, sizeof (long int) == sizeof (long long int) == 8.

1 Like

Thank you @njuffa, so I also assume, in general, it’s better to use fixed width integers https://en.cppreference.com/w/cpp/types/integer than dealing with int/long/long long?

On all platforms currently supported by CUDA int64_t is long long int and int32_t is int. Analogous for the corresponding unsigned integer types. This has been stable for the past 12+ years, and while I do not foresee this changing, a more conservative-minded developer might want to use the specific-width types when re-interpreting float or double data.

A general rule of thumb for the use of integer types is: Every integer wants to be an int unless there is a strong reason for it to be some other type.

1 Like