atomicAdd with float2 no API support, workarounds ?


I would like to do:

__global__ void function( float2* ptr )


float2 someValue = ... ;


atomicAdd( &ptr[address] , someValue);


But atomicAdd only supports float and not float2… I suppose I could:

float* floatPtr = (float*)ptr;

but since “someValue” is in the register of a certain thread this won’t work:

atomicAdd( &floatPtr[address], someValue );

Does anyone have any suggestions? My current solution is to separate the real and imaginary ( working with complex numbers ) parts ( deinterleaved ).

You mean:

atomicAdd(&ptr[address].x, someValue.x);

atomicAdd(&ptr[address].y, someValue.y);

If so, is there a better way :)?




You could overload the ‘+’ operator, too (using the above code). But in such case I would create a new type ‘complex’ and implement the overload only for it, to prevent overloading the already existing vector addition overloads (present in CUDA math headers).

Yes I am aware of that approach aswell but I didn’t think it would be very efficient since it looks like it would be quite uncoalesced?

Your doing a += operation ( global read and write ) on every other element ( first in x ) and the next you’re doing a read/write on every other element in y. So aren’t we potentially loosing half of our bandwith here? But i suppose the caches on Fermi would alleviate this quite a bit.

I havent compared the two but my feeling is that this should be more efficient:

atomicAdd(&ptr_real[address], someValue.x);

atomicAdd(&ptr_imag[address], someValue.y);

What do you think?

I think it’s depending on how your address mapping to your thread. If consecutive thread access consecutive memory, it should be more coaleasing.
However in your case, because you use atomicAdd, I supposed that there are many threads that can access the same address so the coalesing will be more difficult to judge.

Not to mention that your data type is float = 32 bit, and as far as I remember, the recent architecture can fetch 32,64 or 128 bit from the global memory so caching and fetching data in the case of cmaster.matso will be faster.

That’s exactly the problem, consecutive threads would access odd global memory elements in the first instruction and even in the other.



[y][y][y] …

Since addres is defined as:

adress = threadIdx.x + blockIdx.x * blockDim.x;

So if we were to use the approach suggested by cmaster.matso the read from instruction1 would be uncoalesced while on the upside the ‘y’ component would already be cached for the execution of instruction2. On the downside the ‘x’ write would be uncoalesced aswell as the write of ‘y’.

So to summarize:

read X - uncoalesced ( 50% ? )
write X - uncoalesced
read Y - coalseced ( already stored in L1 )
write Y - uncoalesced

Am i reasoning correctly here? Or does Fermi have an ace up its sleeve? :)

PS - The recent architecture can fetch as much as 256 bit

I believe It would be more correct in your case to analyze += operation rather than atomicAdd, because if the address is accessed as you mentioned address = threadIdx.x + blockIdx.x * blockDim.x; then there is no way two threads can access the same address (am i right?) , so why need atomic ?

If it’s += then IMO, readX and readY are both totally coaloesing because it’s still within the range of fetching operation ( 256-bit maximum as you said ) and it’s totally aligned. However for writing it would be uncoalesed.

I have blocks potentially overlapping each other doing a += operation in a very unpredictable manner. It’s very disgusting but atomicAdd is the only option here :)

Only 64bit atomic can save you, pack float2 into 64 bit. If you have not 64bit atomics, only way to to fixed point. And somehow pack 2 floats into 32 bit integer.

Does coalescing even matter for atomics? How many atomic operation units are there?

That’s another possibility, i guess it will be supported directly in a future release. But for now I’m going to stick with the deinterleaved solution:

atomicAdd(&ptr_real[address], someValue.x);

atomicAdd(&ptr_imag[address], someValue.y);

It’s easy and it works, too bad some library API:s require interleaved complex numbers and don’t support both :)

The atomicAdd is executed ~16*10^6 times.

Sure, but I mean: Does the hardware actually run more efficiently if adjacent threads are atomically incrementing adjacent memory locations? I assumed there was only one unit on the GPU servicing atomic operations (as that would be the easiest way to avoid race conditions), so if several threads in a warp issue an atomic operation, they are automatically serialized. The other natural option is one atomic unit per memory controller (divide the bit width of the bus by 64 to get number of memory controllers), in which case it is possible that coalescing helps.

This makes sense to me. I have not idea what is actually happening as this is something NV might be keeping under wraps. When I have time I will test if coalescing has any effect.

In some cases my global read/writes won’t need the atomics but I can’t be sure and I’m guessing there is no runtime optimizations that check for this either.

atomicAdd(&ptr_real[address], someValue.x);
atomicAdd(&ptr_imag[address], someValue.y);

with such solution you may end up with mixed real and image part, if two different threads will write at once,

Keep in mind that coalesced memory accesses are executed within one memory controller (to take advantage of SDRAM’s burst mode). So even with one atomic unit per memory controller a coalesced access would still need to be serialized at least within the controller.

No I don’t think so, the pointers are pointing to two separate memory spaces and someValue is in the registers of one thread. Am I missing something here?

I am sorry, probably not, sum is commutative. And my comment about packing into int64 was wrong. I somehow thought about other atomic operation and possible usage of previouse value of counter. I.s. a=atomicadd b=atomicadd that way real and image parts could be from different numbers.

sorry you were wrong or sorry I’m mistaken? :-)

And yes I expect two different threads to write at once as you described.

Yeah, if each controller handles 64 bits, then I would imagine that a perfectly “coalesced” atomic instruction on a 32-bit element would have a 2-way conflict at a minimum.

I don’t think coalesced accesses are split between multiple memory controllers. For one thing, it would be difficult with the configurable number of memory controllers that are not necessarily powers of two (some even prime numbers). For another thing, it would be wasteful to have multiple memory channels transmit the address. Sending the address on one channel and then using burst mode to pipeline in following doublewords seems both more efficient and simpler to implement.