fast log2 of unsigned int

This is basicly modified __clz() to get maximum performance for log2 estimation on unsigned integers.

May be someone needs it.

static __device__ __forceinline__ unsigned int log2(unsigned int a)

{

	return (a) ? (__float_as_int(__uint2float_rz(a)) >> 23) - 126 : 0;

}

I have also tested branchles version, but it was slower.

return (((-a) >> 31) & 1) * ((__float_as_int(__uint2float_rz(a)) >> 23) - 126);

The following is the __clz() implementation for sm_1x that I put into device_functions.h (__clz() maps directly to a PTX instruction for sm_2x):

__device_func__(int __clz(int a)) {    

  return (a)?(158-(__float_as_int(__uint2float_rz((unsigned int)a))>>23)):32;  

}

Comparing with the posted code, it appears the posted code computes log2(x) for non-zero x as 32 - __clz(x). However, 0 = log2(1) != 32 - __clz(1) = 1, 1 = log2(2) != 32 - __clz(2) = 2, etc. which may not be desired. I don’t have a computer in front of me to verify but it seems that for non-zero x, 32 - __clz(x-1) is equal to ceil(log2(x)), and 31 - __clz(x) is equal to floor(log2(x)).

Whatever formula one choses for integer log2(), it would be best to call __clz() directly, that way one gets the fastest implementation of __clz() on any CUDA-capable GPU.

[Later:]

I wrote a little test program to verify that, for non-zero x, the following holds:

ceil(log2(x) = 32-__clz(x-1)

floor(log2(x) = 31-__clz(x)