atomicAdd with float2 no API support, workarounds ?

_constant · December 2, 2011, 10:51am

Hi,

I would like to do:

__global__ void function( float2* ptr )

{

float2 someValue = ... ;

.....

atomicAdd( &ptr[address] , someValue);

}

But atomicAdd only supports float and not float2… I suppose I could:

float* floatPtr = (float*)ptr;

but since “someValue” is in the register of a certain thread this won’t work:

atomicAdd( &floatPtr[address], someValue );

Does anyone have any suggestions? My current solution is to separate the real and imaginary ( working with complex numbers ) parts ( deinterleaved ).

cmaster.matso · December 2, 2011, 11:35am

You mean:

atomicAdd(&ptr[address].x, someValue.x);

atomicAdd(&ptr[address].y, someValue.y);

If so, is there a better way :)?

Regrads,

MK

P.S.

You could overload the ‘+’ operator, too (using the above code). But in such case I would create a new type ‘complex’ and implement the overload only for it, to prevent overloading the already existing vector addition overloads (present in CUDA math headers).

_constant · December 2, 2011, 12:04pm

Yes I am aware of that approach aswell but I didn’t think it would be very efficient since it looks like it would be quite uncoalesced?

Your doing a += operation ( global read and write ) on every other element ( first in x ) and the next you’re doing a read/write on every other element in y. So aren’t we potentially loosing half of our bandwith here? But i suppose the caches on Fermi would alleviate this quite a bit.

I havent compared the two but my feeling is that this should be more efficient:

atomicAdd(&ptr_real[address], someValue.x);

atomicAdd(&ptr_imag[address], someValue.y);

What do you think?

ikidntu · December 2, 2011, 1:27pm

I think it’s depending on how your address mapping to your thread. If consecutive thread access consecutive memory, it should be more coaleasing.
However in your case, because you use atomicAdd, I supposed that there are many threads that can access the same address so the coalesing will be more difficult to judge.

Not to mention that your data type is float = 32 bit, and as far as I remember, the recent architecture can fetch 32,64 or 128 bit from the global memory so caching and fetching data in the case of cmaster.matso will be faster.

_constant · December 2, 2011, 1:44pm

That’s exactly the problem, consecutive threads would access odd global memory elements in the first instruction and even in the other.

instruction1:

…

instruction2:

[y][y][y] …

Since addres is defined as:

adress = threadIdx.x + blockIdx.x * blockDim.x;

So if we were to use the approach suggested by cmaster.matso the read from instruction1 would be uncoalesced while on the upside the ‘y’ component would already be cached for the execution of instruction2. On the downside the ‘x’ write would be uncoalesced aswell as the write of ‘y’.

So to summarize:

read X - uncoalesced ( 50% ? )
write X - uncoalesced
read Y - coalseced ( already stored in L1 )
write Y - uncoalesced

Am i reasoning correctly here? Or does Fermi have an ace up its sleeve? :)

PS - The recent architecture can fetch as much as 256 bit

ikidntu · December 2, 2011, 1:57pm

I believe It would be more correct in your case to analyze += operation rather than atomicAdd, because if the address is accessed as you mentioned address = threadIdx.x + blockIdx.x * blockDim.x; then there is no way two threads can access the same address (am i right?) , so why need atomic ?

If it’s += then IMO, readX and readY are both totally coaloesing because it’s still within the range of fetching operation ( 256-bit maximum as you said ) and it’s totally aligned. However for writing it would be uncoalesed.

_constant · December 2, 2011, 2:31pm

I have blocks potentially overlapping each other doing a += operation in a very unpredictable manner. It’s very disgusting but atomicAdd is the only option here :)

Lev · December 2, 2011, 2:59pm

Only 64bit atomic can save you, pack float2 into 64 bit. If you have not 64bit atomics, only way to to fixed point. And somehow pack 2 floats into 32 bit integer.

seibert · December 2, 2011, 3:25pm

Does coalescing even matter for atomics? How many atomic operation units are there?

_constant · December 2, 2011, 3:34pm

That’s another possibility, i guess it will be supported directly in a future release. But for now I’m going to stick with the deinterleaved solution:

atomicAdd(&ptr_real[address], someValue.x);

atomicAdd(&ptr_imag[address], someValue.y);

It’s easy and it works, too bad some library API:s require interleaved complex numbers and don’t support both :)

_constant · December 2, 2011, 3:35pm

The atomicAdd is executed ~16*10^6 times.

seibert · December 5, 2011, 2:56pm

Sure, but I mean: Does the hardware actually run more efficiently if adjacent threads are atomically incrementing adjacent memory locations? I assumed there was only one unit on the GPU servicing atomic operations (as that would be the easiest way to avoid race conditions), so if several threads in a warp issue an atomic operation, they are automatically serialized. The other natural option is one atomic unit per memory controller (divide the bit width of the bus by 64 to get number of memory controllers), in which case it is possible that coalescing helps.

_constant · December 5, 2011, 3:04pm

This makes sense to me. I have not idea what is actually happening as this is something NV might be keeping under wraps. When I have time I will test if coalescing has any effect.

In some cases my global read/writes won’t need the atomics but I can’t be sure and I’m guessing there is no runtime optimizations that check for this either.

Lev · December 5, 2011, 6:33pm

atomicAdd(&ptr_real[address], someValue.x);
atomicAdd(&ptr_imag[address], someValue.y);

with such solution you may end up with mixed real and image part, if two different threads will write at once,

tera · December 5, 2011, 6:45pm

Keep in mind that coalesced memory accesses are executed within one memory controller (to take advantage of SDRAM’s burst mode). So even with one atomic unit per memory controller a coalesced access would still need to be serialized at least within the controller.

_constant · December 5, 2011, 6:59pm

No I don’t think so, the pointers are pointing to two separate memory spaces and someValue is in the registers of one thread. Am I missing something here?

Lev · December 5, 2011, 7:06pm

I am sorry, probably not, sum is commutative. And my comment about packing into int64 was wrong. I somehow thought about other atomic operation and possible usage of previouse value of counter. I.s. a=atomicadd b=atomicadd that way real and image parts could be from different numbers.

_constant · December 5, 2011, 7:10pm

sorry you were wrong or sorry I’m mistaken? :-)

And yes I expect two different threads to write at once as you described.

seibert · December 5, 2011, 8:36pm

Yeah, if each controller handles 64 bits, then I would imagine that a perfectly “coalesced” atomic instruction on a 32-bit element would have a 2-way conflict at a minimum.

tera · December 5, 2011, 11:19pm

I don’t think coalesced accesses are split between multiple memory controllers. For one thing, it would be difficult with the configurable number of memory controllers that are not necessarily powers of two (some even prime numbers). For another thing, it would be wasteful to have multiple memory channels transmit the address. Sending the address on one channel and then using burst mode to pipeline in following doublewords seems both more efficient and simpler to implement.

Topic		Replies	Views
AtomicAdd algorithm CUDA Programming and Performance	7	3721	August 25, 2009
atomicAdd(float,float) - atomicMul(float,float) ... CUDA Programming and Performance	13	56398	July 29, 2010
Half2 atomics generate unused code CUDA Programming and Performance	13	137	August 8, 2024
Why does the atomicAdd work faster than â€˜+=â€™? CUDA Programming and Performance	3	12185	November 2, 2011
Is there a way to avoid atomicAdd in my situation? CUDA Programming and Performance	3	1386	March 4, 2019
Atomic float operations. especially add CUDA Programming and Performance	10	16377	July 31, 2009
Threads and Race Condition CUDA Programming and Performance	11	2969	April 30, 2012
multiple threads writing value to a same variable CUDA Programming and Performance	19	6660	March 20, 2012
Strategies for improving Atomic operation peformance CUDA Programming and Performance	5	916	June 18, 2014
Using atomicAdd to step through an array CUDA Programming and Performance	7	3885	May 24, 2011

atomicAdd with float2 no API support, workarounds ?

Related topics