Is atomicExch() safe for incremental a global float array?

I’m using a kernel that is essentially de-multiplexing data so many blocks need to access the same float array in global memory. I’m not aware of versions of atomicAdd(), atomicInc(), etc for floats all we have is atomicExch(). The atomicExch function wouldn’t work when I included the actual element of the array, but it did work when I de-referenced another pointer that pointed to the same address.

float inc;

float* tmp;

tmp = &someArray[ndx];

atomicExch(&someArray[ndx], ( *tmp + inc ) );

This seems to be working fine, but is it really safe? Anyone been using atomicExch() in this fashion?

  • Richard

That isn’t an atomic add. You have no guarantee that in many hundreds of clock cycle window between your load and atomic store that the value won’t change.

You can do atomic FP adds using exchanges. There’s a thread here from about 6 months ago on it.
You swap in 0.0, add your FP addend , swap that result back, and if you don’t get 0.0 on the second swap, keep repeating swaps until you do get 0.0 back. It’s ugly but works.

Note that FP adds are order dependent due to rounding though. Often this is trivial but it always seems to bite people when results differ on different GPUs.

Well that depends upon how the developer chose to implement the atomicExch method. When you put the array element in as an operand for incrementing, nothing at all seems to happen. This may be due to some form of locking for the variable, which is the first thing I would do if I implemented this method, which would explain why atomicExch( &array[ndx], ( array[ndx] + 1.0f) ) doesn’t work. But if the value of the variable is made unavailable, but the memory address of that variable is actually locked only for writing, then this would explain why simply creating another pointer to that address, as in atomicExch( array[ndx], ( *tmp + 1.0f ) ) makes it work. Understand that the kernel I’m using this in creates 720 8x8x4 blocks and that comes out to a reasonably large number of threads and so far it looks good. There would be no reason to lock the memory for reading until the atomicExc completes all operations, so it would be dealing with the representation of that memory location only in the current thread.

Your point is a good one though about the load operation and the fact that another thread might read the value before I get to increment it, but the method has to deal with that problem anyway because it has to have time to handle the operand or operands that it’s given to replace the value at that address. This means there must be a read-only lock secured by atomicExch(), which will block any other atomicExch() call, while it is processing it’s parameter list so it only allows read access by non-atomicExch calls to the value at that memory address. This would greatly increase throughput and makes perfect sense to me, in fact that might be the very first thing that is done in the code of that method. This is typcial for atomic database methods. Then when all the operations are complete, and the value at that address is updated, the new value is “committed” and the read only lock is released, . This could be why this approach works.

I should also mention that I’m not doing this intra-warp or intra-block, I’m updating the global array only once in the last warp of the block with temporary values kept in shared memory being written to global memory after the block completes. This means that no two threads in the same block will call atomicExch(). That could have something to do with the fact that it seems to be working.

  • R

Would be nice to find something a little easier to implement.

Thanks for the info on the prior thread, I’m looking at those older threads now.

  • R

Use fixed point, then you can just use normal integer atomics.

Downside is your precision isn’t adaptive so you’ll have to choose your mappping range carefully.

You could even use two integers to get quite high precision. If the low word wraps (you can detect this by the atomic add result) then you can add the carry to the high word along with your high word addend.

Yeah, you guys were right. This strategy works well in my application, I get the same values every time but I built a test kernel that really bangs hard on atomicExch and it can’t handle it. I get maybe 80% success when I create 32k threads and have every block increment every element of a 32x32x32 float array. The next thing I tried was writing a lock function using a global unsigned int as a wrapper around atomicExch(), but that strategy can’t take the heat either, CUDA gives me a cudaErrorUnknown, probably because all of the blocks are trying to access the the single global unsigned int lock at the same time and currupts memory. The kernel actually runs with no error, but when I try to memcpy the array back I get cudaUnknownError.

NVidia really needs to fix this, I mean after all we are using the GPU because of its ability to deal well with floats, am I right ! ! External Media

What’s the recommended method of employing fixed point as a workaround ?

  • R

Post your failing test code.

I’ve used atomics extensively. I’ve often had problems too, but it’s always been because I screwed up. Once you get fancy (especially in try-until-success test loops) the chances of design error is huge.

If you have correct code that fails, post it so NVidia can figure it out.

As for fixed point, you likely should choose your range and just make two mapping functions something like:

#define LOW_FIXED 0.0f

#define HIGH_FIXED 10.0f

__device__ float fromFixed(unsigned int v) 

{

	return LOW_FIXED+v*(HIGH_FIXED-LOW_FIXED)/0xFFFFFFFF;

}

__device__ unsigned int toFixed(float v)

{

   v=(v-LOW_FIXED) / (HIGH_FIXED-LOW_FIXED);

   return (unsigned int)(v*0xFFFFFFFF);

}

Those are untested off the top of my head but that’s the basic idea.

Be careful of the horrible scary problems if you end up exceeding the range of your fixed point… that’s the limitation you accept for using it.

[quote name=‘SPWorley’ post=‘561307’ date=‘Jul 3 2009, 02:09 PM’]

Post your failing test code.

I’ve used atomics extensively. I’ve often had problems too, but it’s always been because I screwed up. Once you get fancy (especially in try-until-success test loops) the chances of design error is huge.

If you have correct code that fails, post it so NVidia can figure it out.

As for fixed point, you likely should choose your range and just make two mapping functions something like:

[codebox]typedef struct {
int mX;  // columns

int mY;  // rows

int mZ;  // slices

float* elements;

} Matrix3D;

// device code

global void

k_TestAtomicExch( Matrix3D m, int aDim )

{

// 3D thread blocks:  8x8x8,

volatile int bDimX = blockDim.x;

volatile int bDimY = blockDim.y;

volatile int bDimZ = blockDim.z;

volatile int bDimXY = blockDim.x*blockDim.y;

volatile float bThreads = (float)(blockDim.x*blockDim.y*blockDim.z);

int c;

int r;

int s;;

int gNdx;

float* tmp;

__syncthreads();	

	

if (threadIdx.x == bDimX-1 && threadIdx.y == bDimY-1 && threadIdx.z == bDimZ-1) {		

	for (c=0; c<aDim; c++) {

		for (r=0; r<aDim; r++) {

			for (s=0; s<aDim; s++) {

				gNdx = c + r*aDim + s*aDim*aDim;

				tmp = &m.elements[gNdx];

				atomicExch( &m.elements[gNdx], ( *tmp + bThreads ) );

			}

		}

	}

}

}

// host side

extern “C” void

ciTestAtomicExch( Matrix3D &m3d, int aDim )

{

m3d.mX = aDim;

m3d.mY = aDim;

m3d.mZ = aDim;

const int bThreads = 512;

// using 8x8x8 3D blocks

int gxDim = m3d.mX/8;

int gyDim = m3d.mY/8;

int gzDim = m3d.mZ/8;

int blocks = gxDim * gyDim * gzDim;

int tThreads = bThreads * blocks;

int eCnt = 0;

int sCnt = 0;

float sum = 0;

float avgVal;

float minVal = FLT_MAX;

float value;

bool error = false;

dim3 dimBlock(8,8,8);

dim3 dimGrid(gxDim,gyDim*gzDim);

cutilSafeCall( cudaMalloc( (void**) &m3d.elements, tThreads * sizeof( float )) );

float* h_fArray = (float*) calloc( tThreads, sizeof( float ));

cutilSafeCall( cudaMemcpy( m3d.elements, h_fArray, tThreads * sizeof( float ), cudaMemcpyHostToDevice) );

k_TestAtomicExch <<< dimGrid, dimBlock >>> ( m3d, aDim );

cudaThreadSynchronize();

cutilSafeCall( cudaMemcpy( h_fArray, m3d.elements, tThreads * sizeof( float ), cudaMemcpyDeviceToHost) );

for (int c=0; c<aDim; ++c) {

	for (int r=0; r<aDim; ++r) {

		for (int s=0; s<aDim; ++s) {

			value = h_fArray[c + r*aDim + s*aDim*aDim];

			if (value < minVal) minVal = value;

			sum = sum + value;

			if ( (int)h_fArray[c + r*aDim + s*aDim*aDim] != tThreads )  {

				error = true;

				eCnt++;

//				printf("  Found error at index %d, %d, %d illegal value of %g\n",c,r,s,h_fArray[c + r*aDim + s*aDim*aDim]);

			}

			else sCnt++;

		}

	}

}

avgVal = sum/(float)(eCnt+sCnt);

if (!error) printf("   Success!!  No errors found in %d blocks!",blocks); 

else {

	printf("    Errors found:  %d correct values, %d errors\n",sCnt,eCnt);

	printf("           Success Rate: %g %\n",(float)sCnt/(float)(sCnt+eCnt)*100);

	printf("           Min value:  %g,   Avg:  %g\n",minVal,avgVal);

}

free(h_fArray);

cutilSafeCall( cudaFree(m3d.elements) );

}[/codebox]

Output from a run typically will have numbers like this:

Errors found: 0 correct values, 32768 errors

       Success Rate: 0 

       Min value:  2048,   Avg:  22229.1
  • R

In what way does your code avoid race conditions? (hint: it doesn’t at all, what you read in *tmp+bThreads and the return value of atomicExch are certainly not guaranteed to be equivalent)

Well, my original idea was that the atomicExch code might have been written to lock the memory address it’s going to overwrite first (the first parameter) before it evaluates the value in the second parameter, but you are correct, my results indicate that such is not the case or the lock is only partially effective. I have been experimenting with a second approach that shows promise. I’m just showing the changes to the above code plus two new device functions.

...

	if (threadIdx.x == bDimX-1 && threadIdx.y == bDimY-1 && threadIdx.z == bDimZ-1) {		

		for (c=0; c<aDim; c++) {

			for (r=0; r<aDim; r++) {

				for (s=0; s<aDim; s++) {

					gNdx = c + r*aDim + s*aDim*aDim;

					tmp = &m.elements[gNdx];

					dAtomicAddFloat( &m.elements[gNdx], bThreads );

				}

			}

		}

	}

....

__device__ float

dAtomicAddFloat( volatile float* dest, float inc )

{

	volatile float* tmp1 = dest;

	volatile float* tmp2 = dest;

	return atomicExch( (float*)tmp1,  dIncFloat(tmp2, inc ) ); 

}

__device__ float

dIncFloat( volatile float* dest, float inc )

{

	volatile float* tmp = dest;

	float value = *(float*)tmp + inc;

	atomicExch( (float*)tmp, value );

	return value;

}

If I run this code, the results are very different. The majority of elements are now correct:

Errors were found: 29121 correct values, 3647 errors

       Success Rate: 88.8702 

       Min value:  1536,   Avg:  32413.8

Isn’t that interesting? What seems to be happening, is that calling the function three times results in much better locking. I’ve tried this with and without volatile, but it makes more sense that we should be using volatiles for the pointers I think. With a success rate of 88%, it may be that with just a little more tweaking we can get there because this code is really pushing atomicExch hard. Can you tell me anything about how these atomic functions limit access to memory while they are carrying out their operations? Given the partial success of this calling cascade, which is really indirect recursion, can you suggest a tweak that might do the job?

Thanks,

  • R

No. You’re still headed in the wrong direction. Your errors are not because you’re “pushing atomicExch hard”. This is not a GPU or compiler or CUDA or atomicExch error. It’s your code. As tmurray said, it has a race condition. Your changes still haven’t removed the race.

The main problem is that you are reading a value from device memory, doing some additions to it, then writing that back to device memory.

This is completely and totally unsafe! You can (and are) get corrupted inaccurate values when a thread reads a value but before that thread can write its change back to the device, another thread has read the variable too and is unaware of your first threads changes. So values get lost.

There are two solutions.

First is to take several steps back from your code and redesign it to use parallel operations like reduction to safely and efficiently accumulate sums. In your application, it might make a lot more sense to iterate over each array value and have one single thread accumulate the value to be written. Accumulation of constant values doesn’t have any races. Scattering does since many threads all want to change the same value.

The other solution is atomics. But you’re using them incorrectly. If you use fixed point, it’s easier and cleaner. Alternatively, use a safe float atomic like this.

But really, atomics are usually not the answer for this kind of computation. Atomics should be an occasional kind of minor synchronizer, not your core math operation. They’re just not efficient. Look at redesigning your algorithm.

Yes, I agree. This approach is really a dead end. I’ve also been trying to institute an array of unsigned int locks for each element of the output array using atomicCAS but it’s not working either. I’ve done a fair amount of multi-threaded programming and I’m really uncomfortable not having any way to lock my global memory but I guess I’ll just have to get used to it.

Thanks to everyone for the help. I don’t think I’ll be wasting time with atomics in the future.

  • R

You can lock global memory using atomicCAS just fine, it’s just probably not a good idea (for performance reasons). Also, I would do things in a per-warp basis instead of a per-thread basis for various reasons (performance, burden on the compiler, etc).