__mul24 slow down my algorithm?


I’ve tried to use

__mul24(blockIdx.x,blockDim.x) + threadIdx.x;

but, in my case, it’s slower than

blockIdx.x * blockDim.x + threadIdx.x;

This is not a big deal but I just try to understand =)



And how do you measure this? =)

The profiler… I duplicate my function and I replace * by __mul24.
Maybe I’ve made a mistake…

Check the register usage in the cubin or by using “–ptxas-options -v” on the nvcc command line. I’ve noticed that using mul24 instead of * for blockIdx.x * blockDim.x + threadIdx.x increasing the register usage of my kernels in the past. The decrease in occupancy then hurt performance.