Speeding up the Math.Abs() function for integers A little bitwise trick

I was looking up some information on bitwise tricks to do faster operations, and I saw that a quick way to do the absolute value operation on a 32-bit integer is:

//version 1

i = x < 0 ? -x : x;

And that’s the version that nVidia has used for the implementation in device code (see math_functions.h in the include directory of your CUDA toolkit). However, the same page I was reading also lists this version, which apparently is somewhat faster:

//version 2

i = (x ^ (x >> 31)) - (x >> 31);

If I get a little time after work today or tomorrow I’ll replace it and try to run some tests on it. The page I read claims that it’s 20% faster in their tests, but I wonder if it could be even better for CUDA since it there’s no branching. Maybe one of the nVidia compiler guys can take a look as well?

Also, here’s the page I found, if you’re interested in some other ones (some of which are fairly well-known though):


I love these bit-twiddling tricks too, but I’d be surprised if the second one is any faster - the first version should compile to code that uses conditional execution, not branches.

This implementation is only used for the emulation mode (look at the #ifdefs around). The actual device implementation is an intrinsic:


extern __host__ __device__ int		   abs(int) __THROW;

which is directly mapped to a PTX instruction, which itself is mapped to (at most!) one assembly instruction.

And as much as I like to nitpick on arithmetic function implementations, I have to acknowledge that the CUDA math library is really well designed. It is based on state-of-the-art algorithms from the literature and maintains a very good balance between conformance, accuracy and performance…

By the way, if you like bitwise tricks, look for the implementations of __popc and __clz in device_functions.h. Have your pen and paper ready. :)

It would be nice if Cuda used the constant division algorithm found in “Hacker’s Delight” when it compiled kernels. I have noticed massive speedups with it.

Ah, thanks Sylvain. I guess I didn’t notice the #ifdefs there. I thought that the compiler might be using them and just ‘injecting’ them into the device code when compiling. I did take a look at the __popc() function in there though…WOW. :blink:

Yes, this is a big deal in many cases, especially since the GPU is particularly slow at integer divides. I’m surprised the compiler doesn’t do the constant divide tricks already.

It does (for 32-bit arithmetic).