I was looking up some information on bitwise tricks to do faster operations, and I saw that a quick way to do the absolute value operation on a 32-bit integer is:
//version 1
i = x < 0 ? -x : x;
And that’s the version that nVidia has used for the implementation in device code (see math_functions.h in the include directory of your CUDA toolkit). However, the same page I was reading also lists this version, which apparently is somewhat faster:
//version 2
i = (x ^ (x >> 31)) - (x >> 31);
If I get a little time after work today or tomorrow I’ll replace it and try to run some tests on it. The page I read claims that it’s 20% faster in their tests, but I wonder if it could be even better for CUDA since it there’s no branching. Maybe one of the nVidia compiler guys can take a look as well?
Also, here’s the page I found, if you’re interested in some other ones (some of which are fairly well-known though):
I love these bit-twiddling tricks too, but I’d be surprised if the second one is any faster - the first version should compile to code that uses conditional execution, not branches.
This implementation is only used for the emulation mode (look at the #ifdefs around). The actual device implementation is an intrinsic:
/*DEVICE_BUILTIN*/
extern __host__ __device__ int abs(int) __THROW;
which is directly mapped to a PTX instruction, which itself is mapped to (at most!) one assembly instruction.
And as much as I like to nitpick on arithmetic function implementations, I have to acknowledge that the CUDA math library is really well designed. It is based on state-of-the-art algorithms from the literature and maintains a very good balance between conformance, accuracy and performance…
By the way, if you like bitwise tricks, look for the implementations of __popc and __clz in device_functions.h. Have your pen and paper ready. :)
It would be nice if Cuda used the constant division algorithm found in “Hacker’s Delight” when it compiled kernels. I have noticed massive speedups with it.
Ah, thanks Sylvain. I guess I didn’t notice the #ifdefs there. I thought that the compiler might be using them and just ‘injecting’ them into the device code when compiling. I did take a look at the __popc() function in there though…WOW. :blink:
Yes, this is a big deal in many cases, especially since the GPU is particularly slow at integer divides. I’m surprised the compiler doesn’t do the constant divide tricks already.