I’m currently porting an algorithm to CUDA. This algorithm uses the SSE2 instruction set, so it sort of already is parallel.
The algorithm uses the short data type. However I ran into the issue that when using SSE2 and subtracting 10 from -32768 (lower limit of a short) the value stays at -32768. In CUDA however (and probably any programming language), when you would subtract 10 from -32768 you get 32758 (which makes sense). Is there an easy, but more importantly, efficient way to make this not happen in CUDA? So when subtracting 10 from -32768 it stays at -32768.
I was thinking along the lines of this myself:
[codebox]device short subtractShort(short number1, short number2)
return ( (short)(number1 - number2) > number1 ) ? 0x8000 : number1 - number2;
This works, but is it very efficient if every thread in a warp calls this function, but with different parameters?