The following is the __clz() implementation for sm_1x that I put into device_functions.h (__clz() maps directly to a PTX instruction for sm_2x):

```
__device_func__(int __clz(int a)) {
return (a)?(158-(__float_as_int(__uint2float_rz((unsigned int)a))>>23)):32;
}
```

Comparing with the posted code, it appears the posted code computes log2(x) for non-zero x as 32 - __clz(x). However, 0 = log2(1) != 32 - __clz(1) = 1, 1 = log2(2) != 32 - __clz(2) = 2, etc. which may not be desired. I don’t have a computer in front of me to verify but it seems that for non-zero x, 32 - __clz(x-1) is equal to ceil(log2(x)), and 31 - __clz(x) is equal to floor(log2(x)).

Whatever formula one choses for integer log2(), it would be best to call __clz() directly, that way one gets the fastest implementation of __clz() on any CUDA-capable GPU.

[Later:]

I wrote a little test program to verify that, for non-zero x, the following holds:

ceil(log2(x) = 32-__clz(x-1)

floor(log2(x) = 31-__clz(x)