Hi,
I’ve tried to use
__mul24(blockIdx.x,blockDim.x) + threadIdx.x;
but, in my case, it’s slower than
blockIdx.x * blockDim.x + threadIdx.x;
This is not a big deal but I just try to understand =)
Thanks,
Vince
Hi,
I’ve tried to use
__mul24(blockIdx.x,blockDim.x) + threadIdx.x;
but, in my case, it’s slower than
blockIdx.x * blockDim.x + threadIdx.x;
This is not a big deal but I just try to understand =)
Thanks,
Vince
And how do you measure this? =)
The profiler… I duplicate my function and I replace * by __mul24.
Maybe I’ve made a mistake…
Check the register usage in the cubin or by using “–ptxas-options -v” on the nvcc command line. I’ve noticed that using mul24 instead of * for blockIdx.x * blockDim.x + threadIdx.x increasing the register usage of my kernels in the past. The decrease in occupancy then hurt performance.