Subtracting two number without changing sign?


I’m currently porting an algorithm to CUDA. This algorithm uses the SSE2 instruction set, so it sort of already is parallel.

The algorithm uses the short data type. However I ran into the issue that when using SSE2 and subtracting 10 from -32768 (lower limit of a short) the value stays at -32768. In CUDA however (and probably any programming language), when you would subtract 10 from -32768 you get 32758 (which makes sense). Is there an easy, but more importantly, efficient way to make this not happen in CUDA? So when subtracting 10 from -32768 it stays at -32768.

I was thinking along the lines of this myself:

[codebox]device short subtractShort(short number1, short number2)


return ( (short)(number1 - number2) > number1 ) ? 0x8000 : number1 - number2;


This works, but is it very efficient if every thread in a warp calls this function, but with different parameters?



no idea anybody?
I have the feeling the device function in my first post really has a negative impact on performance because of the if-statements.

Have you benchmarked it?

If you are at all memory bandwidth bound (which is highly likely), an if that will clearly be converted to predicated instructions will not slow performance at all.


yeah, I have benchmarked it. I changed the code to not use the full limit of the short. So instead of calling the device function I now do:

result = number1 - number2;

max(0, result);

this has increased the kernel speed by 3 times! Unfortunately this limits the use of result from 0 tot 32767 (short).

It would be nice if nvidia could implement some signed saturation like is available in SSE/MMX.

I suspect that casting to 32 bit ints would be free. Try:

int result= ((int)number1) -number2; // force math to s32, this may be 0 cost

return result >=-32768 ? result : -32768; // implicit cast back to s16.   max(result, -32768) is likely identical

The GPU hardware and PTX does support hardware op saturated differences just like SSE. Unfortunately I don’t think there’s a C opcode for it.

Look in the PTX reference guide. The opcode is sub.sat.s16