AtomicMax with floats

Hi

I have seen that there are no atomic operations for floats. Do you know any easy way of obtaining the maximum of a set of floats? Is there any trick that allows using AtomicMax?

Thanks

Sergio

Yes, look here for a discussion of the trick:

http://forums.nvidia.com/index.php?showtopic=91491

I should note that if you are trying to find the maximum float in a very big list, atomicMax() is not a good approach. It would be better to do a parallel reduction. Usually reduction examples demonstrate the addition operation, but any associative binary operator (like max) will work.

The number of elements in the list would not be large.

Quick question: To convert the float to an integer and then use atomicmax, would “int __float_as_int(float)” work?

No, see the post link http://www.stereopsis.com/radix.html; you have to flip the sign and invert the exponent for negative numbers. It’s a nice trick.

For the point of argument, you could also implement any associative operator with atomicCAS

old = mem[0] # integer

while (true) {

	current = atomic_cas(mem, old, floatasint(<operator>(intasfloat(old), v) ))

	if current == old, return

	old = current

}

that would definitely be slow but wait-free.

I understand the trick, and used this method:

static inline unsigned int FloatFlip(unsigned int f)

{

	unsigned int mask = -int(f >> 31) | 0x80000000;

	return f ^ mask;

}

For some reason all these different floats return the same unsigned integer, so then the trick becomes useless…

float, 3.662095, floatflip,2147483651,

float, 3.834287, floatflip,2147483651,

float, 3.912001, floatflip,2147483651,

Do you know what is wrong with it?

Thanks

How are you converting the float to an unsigned int to pass in as parameter f? If you scan down to the large block of code on the page with the trick, you’ll see that they use a pointer cast to reinterpret the float as an int before calling FloatFlip. If you just pass a float into FloatFlip by value, the float will be cast to an int by truncation, which is not what you want.

Basically, you need to use either __float_as_int() to convert the float to an int for this function, or you need a pointer cast to force the reinterpretation of the float as an int somewhere before calling FloatFlip().

Here’s a similar but simpler method I’ve found works using signed ints instead of unsigned ones: just convert them to ints using __float_as_int(), then toggle all but the most significant bit if the number is -ve.

This maps floats to signed ints in such a way that ordering is preserved, as well preserving whether the number is +ve or -ve and mapping 0 to 0.

Then you can use atomicMin or atomicMax operations, and covert back doing the reverse (for -ve numbers toggle all but the most significant bit, then use __int_as_float() ).

If your floats are always non-negative, you can make things even simpler by just using __float_as_int() and __int_as_float() as they come out of the box.

It might seem a bit too much like magic, but it works because of the bit structure that floats use (single bit for sign, followed by unsigned representation for exponent, and unsigned representation that isn’t denormalized for the mantissa).

Here’s some code I’ve been using for the conversions:

device int floatToOrderedInt( float floatVal )

{

int intVal = __float_as_int( floatVal );

return (intVal >= 0 ) ? intVal : intVal ^ 0x7FFFFFFF;

}

device float orderedIntToFloat( int intVal )

{

return __int_as_float( (intVal >= 0) ? intVal : intVal ^ 0x7FFFFFFF );

}

I’m working on a SVM classifier in which each block is responsible for a testing point. Each thread in the block calculates a floating point value and the data point is classified using the sum of these values.

I came up with 4 different ways to combine the data between the threads:

  1. Each thread does an atomicAdd on a shared variable

  2. Each thread writes to it’s own index in a shared array and the first thread in each block computes the sum

  3. Each thread writes to global memory and the CPU computes the sum

  4. Use a reduction – But I don’t think this will work very well because I only have 284 values and I would have to do the reduction for each block

I was going to implement 1-3 just to see which was faster in my case.

I’m having some trouble with the method 1 (atomicAdd) however.

I tried using both sets of functions mentioned in this thread for transitioning between float and int/unsigned int but my values seem to ‘explode’ regardless.

I have tested in emulation mode and on a GTX 285. Compilation is being done with the -arch compute_12 -code sm_12 flags because the Programming Guide states that Atomic functions operating on shared memory are only available for devices of compute capability 1.2 and above.

Using the int method:

[codebox]

device int floatToOrderedInt( float floatVal )

{

int intVal = __float_as_int( floatVal );

return (intVal >= 0 ) ? intVal : intVal ^ 0x7FFFFFFF;

}

device float orderedIntToFloat( int intVal )

{

return __int_as_float( (intVal >= 0) ? intVal : intVal ^ 0x7FFFFFFF );

}

global void SVMClassify(…)

{

// Shared memory for storing the sum across a block

__shared__ int s_Sum;

const unsigned int sharedIndex = threadIdx.x;

   ...

   ...

// Thread 0 initializes sum to 0

if( sharedIndex == 0 )

{

	s_Sum = floatToOrderedInt(0.0f);

	printf("t[%d]: Set sum = %f\n", sharedIndex, s_Sum);

}

__syncthreads();

float kernel = someFunction(...);

printf("  sv[%d]: Sum = %f, Adding %f\n", mySupportVector, orderedIntToFloat(s_Sum), kernel);

atomicAdd(&s_Sum, floatToOrderedInt(kernel));

printf("    Sum After = %f\n", orderedIntToFloat(s_Sum));

__syncthreads();

// Thread 0 adds the bias and does the classification

if( sharedIndex == 0 )

{

	printf("testPt[%d]: Sum=%f, Sum+bias = %f", myTestPt, orderedIntToFloat(s_Sum), orderedIntToFloat(s_Sum)+bias);

	if( orderedIntToFloat(s_Sum) + bias > 0 )

	{

		...

	}

	else

	{

		...

	}

}

}

[/codebox]

Running with emulation produces the following output:

[codebox]

t[0]: Set sum = 0.000000

sv[0]: Sum = 0.000000, Adding -3.334319

Sum After = -3.334319

sv[1]: Sum = -3.334319, Adding -8.011626

Sum After = 113261883569915010000000000000000000000.000000

sv[2]: Sum = 113261883569915010000000000000000000000.000000, Adding -4.576065

Sum After = 0.148421

sv[3]: Sum = 0.148421, Adding -0.649955

Sum After = -0.000000

sv[4]: Sum = -0.000000, Adding -3.825862

Sum After = -16.407537

sv[5]: Sum = -16.407537, Adding -9.439597

Sum After = 19083237488847770000000000000000000000.000000

sv[6]: Sum = 19083237488847770000000000000000000000.000000, Adding -0.939891

Sum After = 0.119675

sv[7]: Sum = 0.119675, Adding -35.078735

Sum After = -0.000000

sv[8]: Sum = -0.000000, Adding -0.641994

Sum After = -187.571609

testPt[0]: Sum=62074921843877610000000000000.000000, Sum+bias = 62074921843877610000000000000.000000

t[0]: Set sum = 503592908302380660000000000000000000000000000000000000000000

000000000000000000000000000000000000000000000000000000000000

0

000000000000000000000000000000000000000000000000000000000000

0

000000000000000000000000000000000000000000

000000000000000000000000000000000000000000000000000000000000

00000000.000000

[/codebox]

The unsigned int method:

[codebox]

// Provided by http://www.stereopsis.com/radix.html

// Convert a float into an int so we can use atomic operations

typedef long int32;

typedef unsigned long uint32;

inline device uint32 FloatFlip(uint32 f)

{

uint32 mask = -int32(f >> 31) | 0x80000000;

return f ^ mask;

}

inline device uint32 IFloatFlip(uint32 f)

{

uint32 mask = ((f >> 31) - 1) | 0x80000000;

return f ^ mask;

}

global void SVMClassify(…)

{

// Shared memory for storing the sum across a block

__shared__ unsigned int s_Sum;

const unsigned int sharedIndex = threadIdx.x;

   ...

   ...

// Thread 0 initializes sum to 0

if( sharedIndex == 0 )

{

	s_Sum = FloatFlip(__float_as_int(0.0f));

	printf("t[%d]: Set sum = %f\n", sharedIndex, s_Sum);

}

__syncthreads();

float kernel = someFunction(...);

printf("  sv[%d]: Sum = %f, Adding %f\n", mySupportVector, __int_as_float(IFloatFlip(s_Sum)), kernel);

atomicAdd(&s_Sum, FloatFlip(__float_as_int(kernel)));

printf("    Sum After = %f\n", __int_as_float(IFloatFlip(s_Sum)));

__syncthreads();

// Thread 0 adds the bias and does the classification

if( sharedIndex == 0 )

{

	printf("testPt[%d]: Sum=%f, Sum+bias = %f", myTestPt, __int_as_float(IFloatFlip(s_Sum)), __int_as_float(IFloatFlip(s_Sum))+bias);

	if( __int_as_float(IFloatFlip(s_Sum)) + bias > 0 )

	{

		...

	}

	else

	{

		...

	}

}

[/codebox]

Produces the following output:

[codebox]

t[0]: Set sum = 0.000000

sv[0]: Sum = 0.000000, Adding -3.334319

Sum After = 1.332840

sv[1]: Sum = 1.332840, Adding -8.011626

Sum After = 113261883569915010000000000000000000000.000000

sv[2]: Sum = 113261883569915010000000000000000000000.000000, Adding -4.576065

Sum After = -29.002066

sv[3]: Sum = -29.002066, Adding -0.649955

Sum After = -0.000000

sv[4]: Sum = -0.000000, Adding -3.825862

Sum After = 0.246816

sv[5]: Sum = 0.246816, Adding -9.439597

Sum After = 19083237488847770000000000000000000000.000000

sv[6]: Sum = 19083237488847770000000000000000000000.000000, Adding -0.939891

Sum After = -34.726513

sv[7]: Sum = -34.726513, Adding -35.078735

Sum After = -0.000000

sv[8]: Sum = -0.000000, Adding -0.641994

Sum After = 0.023978

testPt[0]: Sum=62074921843877610000000000000.000000, Sum+bias = 62074921843877610000000000000.000000

t[0]: Set sum = 0.000000

[/codebox]

I thought this might have just been due to something funky with the CPU calculations in emulation mode but I get incorrect classifications even if I run in debug mode.

I also tried recompiling using -arch compute_13 -code sm_13 but that did not remedy the situation.

I’ve read through the code multiple times to make sure I am not doing something stupid and couldn’t find anything…hopefully I didn’t just miss something.

I’d appreciate any assistance you guys could provide. Thank you.

[sub]Edit: Changed shared variable from int to unsigned int in the unsigned int method[/sub]

The float to ordered int is a one-to-one mapping which maps the 32 bits of float to the 32 bits of int in a way that preserves ordering, but that’s all. Integer min and max will work, but attempts at addition or multiplication will explode. As you discovered.

It is possible to hack together an atomic add for floats using AtomicCAS.

  1. Get the original float value from memory
  2. Add your float value to it
  3. Attempt to atomically save new value
    3.a. Reinterpret original value as int
    3.b. Reinterpret new value as int
    3.c. Cast pointer to memory to be an int pointer
    3.d. AtomicCAS, which tests if memory contains original, and if so, swaps with the new value
    3.e. If memory did contain original, then we succeeded, otherwise we failed because a different thread updated memory in the meantime.
  4. If save failed, go back to step 1

I can already tell you that this is going to be incredibly slow. Much much better to use shared memory.

There is a strategy for reducing within a block using shared memory, which will be even faster than using thread 0 to add them all. It happens to be a piece of the reduction example, so look at the example code in the SDK.

Thanks for the reply Jamie! If I would have stopped to think about what was going on I probably would (should) have realized that…I guess I just got too excited about the magic trick for atomic operations and didn’t pay enough attention to everything else.

I’ll check out the reduction SDK example again for the within block reduction.

Thanks again!

Having done the normal reduction within the block would it be reasonable to do the following: -

if (cacheIndex == 0) {
while (globalMaxValue < cache[0]) globalMaxValue = cache[0];
}

without race conditions the while loop would execute only once at most.
If there is a race condition with a block storing a smaller value
it will continue looping until either it succeeds
or another block stores a larger value.

No it’s not reasonable.

Block B tests globalMaxValue and finds it to be less than 15
Block A tests globalMaxValue and finds it to be less than 12
Block B stores 15 in globalMaxValue
Block B tests globalMaxValue and finds it to be 15 and exits
Block A stores 12 in globalMaxValue
Block A tests globalMaxValue and finds it to be 12 and exits

The final result is 12 which is incorrect (not the max value). The atomicity of the test and store operation is essential to a proper result.