I’m currently porting an algorithm to CUDA. This algorithm uses the SSE2 instruction set, so it sort of already is parallel.
The algorithm uses the short data type. However I ran into the issue that when using SSE2 and subtracting 10 from -32768 (lower limit of a short) the value stays at -32768. In CUDA however (and probably any programming language), when you would subtract 10 from -32768 you get 32758 (which makes sense). Is there an easy, but more importantly, efficient way to make this not happen in CUDA? So when subtracting 10 from -32768 it stays at -32768.
I was thinking along the lines of this myself:
[codebox]device short subtractShort(short number1, short number2)
If you are at all memory bandwidth bound (which is highly likely), an if that will clearly be converted to predicated instructions will not slow performance at all.
I suspect that casting to 32 bit ints would be free. Try:
int result= ((int)number1) -number2; // force math to s32, this may be 0 cost
return result >=-32768 ? result : -32768; // implicit cast back to s16. max(result, -32768) is likely identical
The GPU hardware and PTX does support hardware op saturated differences just like SSE. Unfortunately I don’t think there’s a C opcode for it.
Look in the PTX reference guide. The opcode is sub.sat.s16