Subtracting two number without changing sign?

Frankeeeh · July 23, 2009, 9:28am

Hi,

I’m currently porting an algorithm to CUDA. This algorithm uses the SSE2 instruction set, so it sort of already is parallel.

The algorithm uses the short data type. However I ran into the issue that when using SSE2 and subtracting 10 from -32768 (lower limit of a short) the value stays at -32768. In CUDA however (and probably any programming language), when you would subtract 10 from -32768 you get 32758 (which makes sense). Is there an easy, but more importantly, efficient way to make this not happen in CUDA? So when subtracting 10 from -32768 it stays at -32768.

I was thinking along the lines of this myself:

[codebox]device short subtractShort(short number1, short number2)

{

return ( (short)(number1 - number2) > number1 ) ? 0x8000 : number1 - number2;

}[/codebox]

This works, but is it very efficient if every thread in a warp calls this function, but with different parameters?

Cheers,

Frank

Frankeeeh · August 6, 2009, 12:40pm

no idea anybody?
I have the feeling the device function in my first post really has a negative impact on performance because of the if-statements.

MisterAnderson42 · August 6, 2009, 1:17pm

Have you benchmarked it?

If you are at all memory bandwidth bound (which is highly likely), an if that will clearly be converted to predicated instructions will not slow performance at all.

Frankeeeh · August 10, 2009, 12:52pm

Hi,

yeah, I have benchmarked it. I changed the code to not use the full limit of the short. So instead of calling the device function I now do:

result = number1 - number2;

max(0, result);

this has increased the kernel speed by 3 times! Unfortunately this limits the use of result from 0 tot 32767 (short).

It would be nice if nvidia could implement some signed saturation like is available in SSE/MMX.

SPWorley · August 10, 2009, 4:31pm

I suspect that casting to 32 bit ints would be free. Try:

int result= ((int)number1) -number2; // force math to s32, this may be 0 cost

return result >=-32768 ? result : -32768; // implicit cast back to s16.   max(result, -32768) is likely identical

The GPU hardware and PTX does support hardware op saturated differences just like SSE. Unfortunately I don’t think there’s a C opcode for it.

Look in the PTX reference guide. The opcode is sub.sat.s16

Topic		Replies	Views
How add/sub long numbers with PTX CUDA Programming and Performance	2	8015	June 6, 2011
Floating Point Subtraction CUDA Programming and Performance	6	2040	July 2, 2010
Can't subtract uchar1 ? no operator matches these operads CUDA Programming and Performance	1	5763	September 4, 2008
Weird results for image subtraction CUDA Programming and Performance	6	3473	July 29, 2008
sign() function CUDA Programming and Performance	22	13690	August 24, 2010
Sign determination in CUDA CUDA Programming and Performance	6	3795	June 9, 2011
Speeding up the Math.Abs() function for integers A little bitwise trick CUDA Programming and Performance	6	6816	July 23, 2009
Extract the sign of a float CUDA Programming and Performance	1	1268	August 6, 2009
Add.Sat.itype instruction CUDA Programming and Performance	1	1173	August 26, 2008
NPP Subtraction Using CUDA NPP for array subtraction. CUDA Programming and Performance	0	2458	January 3, 2011

Subtracting two number without changing sign?

Related topics