CUDA __half atomicAdd Poor computing time

Hello!
I’m trying to use the __half atomicAdd as on the website. Is it normal that it is much slower than the float version?
My cuda version is 12.3 and the actual computing times are:

  • 6.7ms with floating atomicAdd
  • 83ms with __half atomicAdd

Thank you in advance!

It doesn’t appear to be the case according to my testing:

# cat t127.cu
#include <cuda_fp16.h>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

const int nTPB = 128;
const size_t nBLK = 1048576ULL;
const size_t ds = nBLK*nTPB;
#ifndef USE_FLOAT
using ft = half;
#else
using ft = float;
#endif
__global__ void k(const ft * __restrict__ i, ft * __restrict__ o){
        size_t idx = blockIdx.x*blockDim.x+threadIdx.x;
        atomicAdd(o, i[idx]);
}

int main(){

  ft *i, *o, *hi;
  cudaMalloc(&i, ds*sizeof(ft));
  cudaMalloc(&o,    sizeof(ft));
  hi = (ft *)malloc(ds*sizeof(ft));
  for (size_t i = 0; i < ds; i++)
#ifndef USE_FLOAT
    hi[i] = __float2half(-1.0f);
#else
    hi[i] = -1.0f;
#endif
  cudaMemcpy(i, hi, ds*sizeof(ft), cudaMemcpyHostToDevice);
  k<<<nBLK, nTPB>>>(i, o); // warm-up
  cudaDeviceSynchronize();
  unsigned long long dt = dtime_usec(0);
  k<<<nBLK, nTPB>>>(i, o);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  cudaError_t err = cudaGetLastError();
  if (err == cudaSuccess) std::cout << "Duration: " << dt/(float)USECPSEC << "s" << std::endl;
  else std::cout << "Error: " << cudaGetErrorString(err) << std::endl;
}

# nvcc -o t127 t127.cu -arch=sm_89
# ./t127
Duration: 0.085885s
# nvcc -o t127 t127.cu -arch=sm_89 -DUSE_FLOAT
# ./t127
Duration: 0.220041s
#

I do expect the float case to be about 2x slower for my test, as it is loading 2x the number of bytes.

Hello Robert, thank you for the answer!
Is it the same when you try to add a negative number? Because it is were I see a really big difference

There shouldn’t be any difference whether the numbers added are positive or negative. In my test case, if I initialize all input values to -1, there is no change in timing. (I’ve updated the above listing to demonstrate that.) I don’t think your claim is supportable.

If you believe it is, I suggest you provide a short, complete test case, just as I have done, to demonstrate your claim conclusively. Or, if you like, run my test case on your machine.