Does CUDA support vectorized instruction for += and atomicAdd?

In CUDA kernel, can we use float4 a += (float4) input[xxx];
Or atomicAdd(&out, a);

What should we write if we want to implement this functionality in kernel?

For adding (as the order of the terms can be changed), atomically adding all 4 values of a float4 individually, should give the same result, as you intend?

There also may be alternate solutions (depending on whether you need only the final result or intermediate results in between):

  • write the results individually from each thread to different memory locations and then run a reduction kernel, which adds up the values in parallel
  • lock the other running threads during an update operation (works better if not too many writes occur)
  • try to compress your value, e.g. to 4x8 bits = char4 instead of float4

No it is not provided for natively. If you want to do this, you could write it out:

a.x += ((float4 *)input)[xxx].x;
a.y += ((float4 *)input)[xxx].y;
a.z += ((float4 *)input)[xxx].z;
a.w += ((float4 *)input)[xxx].w;

or, probably better:

float4 temp = ((float4 *)input)[xxx];
a.x += temp.x;
a.y += temp.y;
a.z += temp.z;
a.w += temp.w;

or you could provide your own overload. How to do that is not unique or specific to CUDA C++, the concept is identical to how you would write an operator overload in C++.

There are recently introduced atomic support for certain vector types, including atomicAdd for float4. However at the current time, this support is limited to compute capability 9.x devices (Hopper, currently).

As already indicated, you could/can do atomics on the individual float components. In the case of atomicAdd, I’m not aware that this could lead to a different final result in the variable, vs. a native float4 version. However, there is something to be aware of in the simultaneous access case.

If you intermix ordinary reads of the variable, with atomic updates, then there is no guarantee that an intermediate read of the float4 quantity or its components will be coherent with a atomic update of the quantity, whether done individually on the components or in the hopper float4 atomic case (this “hazard” is specifically called out in the documentation previously linked.)

If we ignore that case, and presume a case where only atomic updates are being performed, in the case of atomicAdd, I know of no way to explain any difference in final outcome between an imagined or real atomic on the float4 quantity vs. updating the components individually.

Also, as an aside, in CUDA, you should only do this sort of cast:

if the input item is a properly naturally aligned pointer for float4 access. (I’m assuming that although you typed (float4)input you actually meant (float4 *)input)