Atomic Functions Performance

Does anyone have any solid information on performance of atomic functions? I’ve looked all around, and the only thing I can find is ‘it’s slow’. Just how slow is that? In particular, how do atomic functions perform when trying to write to contested versus non-contested memory locations?

Basically, I want to know if they are suitable for building a spin lock, in this case, for scattering points into a large texture. The texture is expected to be on the order of 1kx1k elements, with writes done more or less randomly, so the lock would only block occasionally.

The code would look like this:

while (atomicExch(&lock, 1)!=0) //spin until we get a lock
//write stuff
lock = 0; //unlock

lock is an integer stored per element, so this is a fine grained lock.

Even ignoring all contention issues, atomic ops are slow if you access them uncoalesced: just as slow as normal uncoalesced reads/writes. Since atomic ops are usually used when accessing values somewhat randomly, this tends to always be the case and hence the over-generalization of “slow”.

As far as contention goes: what would be a small amount of contention in a 4-thread CPU algorithm can take the 10,000+ concurrently running threads on the GPU and slow them to a screeching halt. How much contention is too much? I have no idea. With your app, it doesn’t sound like there is too much but only benchmarks will tell.

A good benchmark for contentions is the CUDA SDK’s Histogram256 project.

If the input data is random, output writes are distributed over the output vector and performance is OK.

But if you modify the demo so that all input values are the same, then processing time explodes. It’s crashed my computer on several occasions :argh:

The demo tests the different atomics capabilities (1.0, 1.1, 1.2) so it might be a good starting point for evaluation.

I ran a test on this a couple of days ago, for an uncontested access. Comparing

__global__ void SimpleAdd( int *A, const int n ) {

  int i = threadIdx.x + (blockDim.x*blockIdx.x);

  if( i < n ) {

    A[i] = A[i] + 1;

  }

}

and

__global__ void AddAtomic( int *A, const int n ) {

  int i = threadIdx.x + (blockDim.x*blockIdx.x);

 if( i < n ) {

    atomicAdd( &(A[i]), 1 );

  }

}

I found that for 8M elements, the first kernel ran in 1.2ms, the second in 6.15ms (averaged over 128 launches). So it looks to be at least 4x slower. This is on a GeForce 9800 GX2.

4x slower than coalesced accesses–this is an important distinction. Most times you’d want to use an atomic op, I think, would be for completely random accesses.

Also, don’t try to implement a lock. I know, it’s tempting, it would make things simple, etc. You’ll either deadlock the card or tie the underlying implementation to the card so thoroughly that you’ll get zero perf benefit from future cards unless you rearrange your grids/blocks. (Maybe you could do this with templates? Regardless, you probably can’t have enough threads in flight to effectively hide memory latency if you try to make a lock.)

Fair point… although for the application I was looking at (see my thread on gridding) I was having trouble with bunches of coalesceable operations.

Thanks for the responses!

I can’t think of any realistic way to coalesce the point outputs - in fact, each iteration I will be swizzling the points to ensure that they end up in different warps (it’s an IFS algorithm, so if every point in a group were to take the same set of iteration functions, they’d degenerate toward a single point). However, the actual memory scattering should hopefully be masked by the sizable amount of computation involved for each iteration.