Maximum absolute value Fastest method for floating point numbers

I’m optimizing some code, and currently one of my bottlenecks is trying to find the maximum absolute element of a vector. In this case, the vector length is 24, and all variables are within registers. Currently, I just do the following (where my vector is stored as 6 float4 variables:

[codebox]

float c0 = fmaxf(fabsf((a0).x), fabsf((a0).y));

float c1 = fmaxf(fabsf((a0).z), fabsf((a0).w));

float c2 = fmaxf(fabsf((a1).x), fabsf((a1).y));

float c3 = fmaxf(fabsf((a1).z), fabsf((a1).w));

float c4 = fmaxf(fabsf((a2).x), fabsf((a2).y));

float c5 = fmaxf(fabsf((a2).z), fabsf((a2).w));

float c6 = fmaxf(fabsf((a3).x), fabsf((a3).y));

float c7 = fmaxf(fabsf((a3).z), fabsf((a3).w));

float c8 = fmaxf(fabsf((a4).x), fabsf((a4).y));

float c9 = fmaxf(fabsf((a4).z), fabsf((a4).w));

float c10 = fmaxf(fabsf((a5).x), fabsf((a5).y));

float c11 = fmaxf(fabsf((a5).z), fabsf((a5).w));

c0 = fmaxf(c0, c1); c1 = fmaxf(c2, c3); c2 = fmaxf(c4, c5);

c3 = fmaxf(c6, c7); c4 = fmaxf(c8, c9); c5 = fmaxf(c10, c11);

c0 = fmaxf(c0, c1); c1 = fmaxf(c2, c3); c2 = fmaxf(c4, c5);

c0 = fmaxf(c0, c1); c0 = fmaxf(c0, c2);

[/codebox]

I just wondered if anyone had any smarter way of doing this? I see in the programming guide that integer max and min functions are one cycle operations, but I don’t think this is true for fmaxf. I don’t believe there is a faster solution, but if anyone has one, I’d love to hear it.

Cheers.

I think you could use the reduction technique here (see the reduction example - it is used to calculate a sum, but you can also adapt it to calculate the maximum, …)

Is each thread working on a different vector?

Yes, exactly. Each thread is dealing with vectors of length 24. (FYI, this is a vector field problem, where I have a grid, but each grid point is itself a vector.)

I should also mention, that I’ve tried playing around with the integer max function. According to the CUDA programming guide, this is a one cycle operation. Something like:

[codebox]#define FAST_ABS_MAX(A, B ) __int_as_float(max( __float_as_int(fabsf(A)) , __float_as_int(fabsf(B ))));[/codebox]

works because the integer representation of floating point numbers is correctly ordered. However, performance is slower this way.

Oh well, I guess I should just call it a day here, and move on…

What happens if you do

[codebox]#define FAST_ABS_MAX(A, B ) __int_as_float(max( abs(__float_as_int(A)) , abs(__float_as_int(B ))));[/codebox]

That should get rid of all the FP operations.

I’m not sure why, but this suggestion evaluates incorrectly. Why these two operations don’t commute is beyond me at this time on a Friday.

The fastest combination that I’ve come up with is given below, though performance varies a lot depending on what the code is that surrounds this (I guess the compiler plays tricks).

[codebox]#define FAST_ABS_MAX(a, b ) \

max( __float_as_int(fabsf(a)) , __float_as_int(fabsf(b )));

device uint fast_abs_max(float4 a) {

uint c0 = FAST_ABS_MAX(a.x, a.y);

uint c1 = FAST_ABS_MAX(a.z, a.w);

return max(c0, c1);

}

uint c0 = fast_abs_max(a0);

uint c1 = fast_abs_max(a1);

c0 = max(c0, c1);

uint c2 = fast_abs_max(a2);

uint c3 = fast_abs_max(a3);

c1 = max(c2, c3);

c0 = max(c0, c1);

c2 = fast_abs_max(a4);

c3 = fast_abs_max(a5);

c1 = max(c2, c3);

c0 = max(c0, c1); [/codebox]

It is because negative floating point values are represented differently than negative integers. While in integer form it is a value subtracted from 0 (e.g. 0xffffffff stands for -1), than in floating point form, only first sign bit is flipped. All fabsf do is just swap that first bit! One could define it as

__device__ float fabsf(float var) {

return __int_as_float( __float_as_int(var) & 0x7fffffff );

}

Use decuda to look at the code that nvcc generates.

The most straightforward version should be the fastest, and this line should map to a single instruction:

float c0 = fmaxf(fabsf((a0).x), fabsf((a0).y));

(Performing fabsf and FP negations before a FP operation requires approximately zero additional hardware…)

Don’t forget GPUs were once floating-point only processors… Floating-point arithmetic is still typically as fast or faster than integer arithmetic.

Now that you mention it, ISTR that the article in which I found the casting trick noted that it depended on the integer representation being sign-magnitude, and not two’s complement…

but negative integers are usually two’s complement. With this setting, performing subtraction is exactly the same as performing an addition on a negative number without any special treatment of the sign bit.

To overcome this problem you can xor with 0x7fffffff the float value, when it is negative. And then when the result is negative, xor again.