How can I use atomicSub for floats and doubles?

pettes · August 21, 2018, 4:33pm

I would like to use atomicSub but for floats and doubles. From what I understand of the documentation, this is not yet supported and the built in atomicSub only works with integers.

In the documentation they show how one can go about implementing atomicAdd for doubles for older GPUs with atomicCAS(), therefore I thought I would be able to easily implement my own version of atomicSub for floats by modifying it like this:

__device__ double atomicSubFloat(float* address, float val)
{
    unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;

    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val -
                               __longlong_as_double(assumed)));

    // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);

    return __longlong_as_double(old);
}

But it doesn’t seem to work. Presumably “__double_as_longlong” and ” __longlong_as_double” should be something else when working with floats, but I don’t know what to replace them with. Does anyone know how to use atomicSub for floats and doubles?

Robert_Crovella · August 21, 2018, 4:43pm

I"m sure this is “fixable” by using e.g. __float_as_int, but is it possible to use atomicAdd while changing the sign of the addend?

atomicAdd(address, -val);

?

pettes · August 21, 2018, 5:55pm

Yes, that is in fact the exact workaround I’m using at the moment. But the kernel I’m doing that in is the largest bottleneck in my entire program according to the profiler. So I thought that it would be slightly more efficient to call atomicSub directly instead of adding the “unnecessary” multiplication of negative one.

Robert_Crovella · August 21, 2018, 6:34pm

Your proposal doesn’t call atomicSub directly (at least, not the built-in version of atomicSub). It is calling atomicCAS. And there is additional code involved, besides just the call to atomicCAS.

The mechanism you’re proposing can be made to work correctly, but it will be less efficient than calling atomicAdd with an additional multiplication.

pettes · August 21, 2018, 7:15pm

That’s too bad. If you are sure that the custom atomicSub will be slower than built-in atomicAdd + the extra multiplication I will trust your judgement and keep using my current code. Hopefully nvidia will release a built-in atomicSub for floats/doubles in the near future, it would be handy to have available.

Robert_Crovella · August 21, 2018, 8:03pm

device floating point atomicSub operations could be realized like this:

__device__ double my_atomicSub(double* address, double val) {
 unsigned long long int* address_as_ull = (unsigned long long int*)address;
 unsigned long long int old = *address_as_ull, assumed;
 do {
      assumed = old;
      old = atomicCAS(address_as_ull, assumed, __double_as_longlong(__longlong_as_double(assumed) - val)); // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);
  return __longlong_as_double(old);
}

__device__ float my_atomicSub(float* address, float val) {
 int* address_as_int = (int*)address;
 int old = *address_as_int, assumed;
 do {
      assumed = old;
      old = atomicCAS(address_as_int, assumed, __float_as_int(__int_as_float(assumed) - val)); // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);
  return __int_as_float(old);
}

test case:

$ cat t278.cu
#include <stdio.h>

__device__ double my_atomicSub(double* address, double val) {
 unsigned long long int* address_as_ull = (unsigned long long int*)address;
 unsigned long long int old = *address_as_ull, assumed;
 do {
      assumed = old;
      old = atomicCAS(address_as_ull, assumed, __double_as_longlong(__longlong_as_double(assumed) - val)); // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);
  return __longlong_as_double(old);
}

__device__ float my_atomicSub(float* address, float val) {
 int* address_as_int = (int*)address;
 int old = *address_as_int, assumed;
 do {
      assumed = old;
      old = atomicCAS(address_as_int, assumed, __float_as_int(__int_as_float(assumed) - val)); // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);
  return __int_as_float(old);
}

template <typename T>
__global__ void k(T *addr, T val){

  my_atomicSub(addr, val);
}

int main(){

  float *d_f, h_f = 0.0f;
  double *d_d, h_d = 0.0;

  cudaMalloc(&d_f, sizeof(float));
  cudaMalloc(&d_d, sizeof(double));
  cudaMemset(d_f, 0, sizeof(float));
  cudaMemset(d_d, 0, sizeof(double));

  k<<<1,5>>>(d_f, 1.0f);
  k<<<1,6>>>(d_d, 1.0);
  cudaMemcpy(&h_f, d_f, sizeof(float), cudaMemcpyDeviceToHost);
  cudaMemcpy(&h_d, d_d, sizeof(double), cudaMemcpyDeviceToHost);
  printf("float result: %f\n", h_f);
  printf("double result: %f\n", h_d);
  return 0;
}
$ nvcc -o t278 t278.cu
$ cuda-memcheck ./t278
========= CUDA-MEMCHECK
float result: -5.000000
double result: -6.000000
========= ERROR SUMMARY: 0 errors
$

v.miheer · October 13, 2020, 11:25pm

Is it better or worse to use uint32_t and uint64_t instead of int and long long int? I see Way to get floating-point special values in CUDA? - Stack Overflow which also seem to be using long long int. Also are long and long long equivalent in terms of bits in cuda?

njuffa · October 14, 2020, 12:06am

Avoid the use of long {int}. In CUDA, because it allows the sharing of code between host and device, the size of elementary types is necessarily identical between host and device. On a Windows host, sizeof (long int) == sizeof (int) == 4, but on a Linux host, sizeof (long int) == sizeof (long long int) == 8.

v.miheer · October 14, 2020, 12:15am

Thank you @njuffa, so I also assume, in general, it’s better to use fixed width integers Fixed width integer types (since C++11) - cppreference.com than dealing with int/long/long long?

njuffa · October 14, 2020, 12:22am

On all platforms currently supported by CUDA int64_t is long long int and int32_t is int. Analogous for the corresponding unsigned integer types. This has been stable for the past 12+ years, and while I do not foresee this changing, a more conservative-minded developer might want to use the specific-width types when re-interpreting float or double data.

A general rule of thumb for the use of integer types is: Every integer wants to be an int unless there is a strong reason for it to be some other type.

Topic		Replies	Views
Atomic Functions for double precision CUDA Programming and Performance	2	3534	June 8, 2009
Using atomic memory functions Legacy PGI Compilers	20	13337	October 25, 2012
atomicAdd(float,float) - atomicMul(float,float) ... CUDA Programming and Performance	13	57134	July 29, 2010
Atomic Add with Doubles OptiX	2	3663	July 17, 2017
Atomic add double in Fermi CUDA Programming and Performance	1	29331	March 20, 2011
help from the float atomic add How to release a float atomic add in the old device CUDA Programming and Performance	4	5754	April 17, 2010
Atomic Add 1.3 Float/Double Atomic Add CUDA Programming and Performance	4	2492	December 17, 2010
Try to implement a AtomicAddDouble function CUDA Programming and Performance	7	2577	January 31, 2011
how to do atomic add using double precision CUDA Programming and Performance	0	5301	October 6, 2011
Performance difference between atomicAdd() on 32 bit words(float, int, unsigned int). CUDA Programming and Performance	2	952	June 13, 2015

How can I use atomicSub for floats and doubles?

Related topics