fast log2 of unsigned int

yurijgera · July 10, 2011, 10:52am

This is basicly modified __clz() to get maximum performance for log2 estimation on unsigned integers.

May be someone needs it.

static __device__ __forceinline__ unsigned int log2(unsigned int a)

{

	return (a) ? (__float_as_int(__uint2float_rz(a)) >> 23) - 126 : 0;

}

I have also tested branchles version, but it was slower.

return (((-a) >> 31) & 1) * ((__float_as_int(__uint2float_rz(a)) >> 23) - 126);

njuffa · July 10, 2011, 7:28pm

The following is the __clz() implementation for sm_1x that I put into device_functions.h (__clz() maps directly to a PTX instruction for sm_2x):

__device_func__(int __clz(int a)) {    

  return (a)?(158-(__float_as_int(__uint2float_rz((unsigned int)a))>>23)):32;  

}

Comparing with the posted code, it appears the posted code computes log2(x) for non-zero x as 32 - __clz(x). However, 0 = log2(1) != 32 - __clz(1) = 1, 1 = log2(2) != 32 - __clz(2) = 2, etc. which may not be desired. I don’t have a computer in front of me to verify but it seems that for non-zero x, 32 - __clz(x-1) is equal to ceil(log2(x)), and 31 - __clz(x) is equal to floor(log2(x)).

Whatever formula one choses for integer log2(), it would be best to call __clz() directly, that way one gets the fastest implementation of __clz() on any CUDA-capable GPU.

[Later:]

I wrote a little test program to verify that, for non-zero x, the following holds:

ceil(log2(x) = 32-__clz(x-1)

floor(log2(x) = 31-__clz(x)

Topic		Replies	Views
Integer logarithm to the base 2 - how? CUDA Programming and Performance	5	6530	May 17, 2008
What is the work-efficient time complexities for integer intrinsics functions in CUDA? CUDA Programming and Performance	0	361	May 1, 2020
Faithfully-rounded implementation of log10f(), without performance penalty CUDA Programming and Performance	0	340	July 6, 2022
Possible bug with unsigned 64 bit int modulo CUDA Programming and Performance	8	9408	July 14, 2009
Feature request for__float_as_unsignedint, __unsignedint_as_float CUDA: CUDA Programming and Performance	5	4053	March 29, 2011
CUDA, performance evaluation CUDA Programming and Performance	3	7872	July 31, 2008
Possible CUDA runtime library error? CUDA Programming and Performance	5	1789	March 30, 2009
Math Intrinsics are not speeding up the performance CUDA Programming and Performance	2	586	April 5, 2019
A faithfully-rounded performance-competitive implementation of log10f CUDA Programming and Performance	0	534	April 20, 2019
Improving division in kernel How to use log2 in a kernel? CUDA Programming and Performance	1	6295	May 18, 2009

fast log2 of unsigned int

Related topics