AtomicAdd not overloaded for c10::Half

DoubleFloater · February 25, 2022, 9:05am

In my PyTorch CUDA extension, I require an atomicAdd. But I get the following error:

error: no instance of overloaded function "atomicAdd" matches the argument list argument types are: (c10::Half *, c10::Half)

The kernel is dispatched using PyTorch’s AT_DISPATCH_FLOATING_TYPES_AND_HALF macro; it only compiles when I change it to AT_DISPATCH_FLOATING_TYPES, indicating that it works with float and double, but not the half datatype.

My code looks as follows:

template <typename scalar_t>
__global__ void f(torch::PackedTensorAccessor32<scalar_t, 2, torch::RestrictPtrTraits> x, ...) {
    // ...
    atomicAdd(&x[i][j], y);
    // ...
}

I tried:

#include <cuda_fp16.h>
Using flags extra_cuda_cflags=['-gencode=arch=compute_61,code=sm_61'] as described in Why does atomicAdd not work with doubles as input?
Changing atomicAdd to gpuAtomicAdd as described in C10::Half float type support for atomicAdd?

System info:

Driver Version: 510.47.03
CUDA Version: 11.6
Model: Nvidia GeForce GTX 1070

It would be nice if someone could help.

striker159 · February 25, 2022, 3:20pm

The following overloads for atomicAdd exist. Programming Guide :: CUDA Toolkit Documentation

int atomicAdd(int* address, int val);
unsigned int atomicAdd(unsigned int* address,
                       unsigned int val);
unsigned long long int atomicAdd(unsigned long long int* address,
                                 unsigned long long int val);
float atomicAdd(float* address, float val);
double atomicAdd(double* address, double val);
__half2 atomicAdd(__half2 *address, __half2 val);
__half atomicAdd(__half *address, __half val);
__nv_bfloat162 atomicAdd(__nv_bfloat162 *address, __nv_bfloat162 val);
__nv_bfloat16 atomicAdd(__nv_bfloat16 *address, __nv_bfloat16 val);

If c10::Half is compatible to one of those types, you can just use typecasts. Otherwise, you need to implement your own compare-and-swap-based atomicAdd.

DoubleFloater · February 27, 2022, 9:30pm

Thank you for the answer! That link was helpful. It says:

The 16-bit __half floating-point version of atomicAdd() is only supported by devices of compute capability 7.x and higher.

I had already tried overloading the function and then casting to __half in the case where scalar_t == c10::Half. But it turns out that my GeForce GTX 1070 does not support __half atomicAdd(__half *address, __half val) as it has compute capability 6.1 as shown here.

Do you know of a way to work around the issue? How do I implement my own atomicAdd for__half values?

striker159 · February 28, 2022, 8:25am

It is explained in the linked documentation, right above the atomicAdd overloads.

DoubleFloater · March 5, 2022, 7:31pm

Thank you! It took me a while, but I eventually figured out the following code (tested and working for my purpose):

#ifdef __CUDA_ARCH__
#if __CUDA_ARCH__ < 700
// adapted from https://github.com/torch/cutorch/blob/master/lib/THC/THCAtomics.cuh
__device__ __forceinline__ void atomicAdd(c10::Half* address, c10::Half val) {
    unsigned int *address_as_ui = reinterpret_cast<unsigned int *>(reinterpret_cast<char *>(address) - (reinterpret_cast<size_t>(address) & 2));
    unsigned int old = *address_as_ui;
    unsigned int assumed;

    do {
        assumed = old;
        unsigned short hsum = reinterpret_cast<size_t>(address) & 2 ? (old >> 16) : (old & 0xffff);
        hsum += val;
        old = reinterpret_cast<size_t>(address) & 2
                 ? (old & 0xffff) | (hsum << 16)
                 : (old & 0xffff0000) | hsum;
        old = atomicCAS(address_as_ui, assumed, old);

    // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);
}
#endif
#endif

Topic		Replies	Views
CUDA __half atomicAdd Poor computing time CUDA NVCC Compiler cuda	3	541	February 2, 2024
Atomic operation in FP16 CUDA Programming and Performance	2	2129	February 22, 2017
Still unclear on 16-bit float atomic operations for consumer Pascal GPUs CUDA Programming and Performance	1	1066	July 3, 2016
Half2 atomics generate unused code CUDA Programming and Performance	12	449	August 8, 2024
How to cuda half and half functions CUDA Programming and Performance	5	4175	January 10, 2019
__half and standard operators + * / - CUDA Programming and Performance	4	654	February 7, 2023
Using __half atomicAdd in the anyhit module causes an OPTIX_ERROR_PIPELINE_LINK_ERROR OptiX optix	2	248	May 17, 2024
To use atomic add Legacy PGI Compilers	13	11990	June 30, 2012
how to make float atomic add on 1.0 hardware? CUDA Programming and Performance	0	2275	June 14, 2008
atomicAdd(float,float) - atomicMul(float,float) ... CUDA Programming and Performance	13	57134	July 29, 2010

AtomicAdd not overloaded for c10::Half

Related topics