Comparing with the posted code, it appears the posted code computes log2(x) for non-zero x as 32 - __clz(x). However, 0 = log2(1) != 32 - __clz(1) = 1, 1 = log2(2) != 32 - __clz(2) = 2, etc. which may not be desired. I don’t have a computer in front of me to verify but it seems that for non-zero x, 32 - __clz(x-1) is equal to ceil(log2(x)), and 31 - __clz(x) is equal to floor(log2(x)).
Whatever formula one choses for integer log2(), it would be best to call __clz() directly, that way one gets the fastest implementation of __clz() on any CUDA-capable GPU.
[Later:]
I wrote a little test program to verify that, for non-zero x, the following holds: