address calculation in cuda How to calculation address efficiently?

after checking the generated PTX file, the address of array element is calculated using MUL instruction and the sequence looks as follows:

mov.u32 $r0, (&A); # load array base

mul.lo.u32 $r2, $r1, 4; # calculate the index for unsigned int array
mul.lo.u32 $r2, $r1, 16; # calculate the index for uint4 array

add.u32 $r3, $r2, $r0;

ld.global.v1.u32 [$r3+0], $r4; # load unsigned int
ld.global.v4.u32 [$r3+0], {$r4, $5, $6, $7}; # load uint4

The programming guide states that MUL takes 8 cycles compared to 2 cycles of regular integer operations. Is it possible to direct the compiler generate the SHL instead of MUL to calculate the index?

  • DB

Try calculating the pointer to the final first uint yourself using __mul24 and cast the result to uint4. I played around with that but couldn’t get a definitive speedup because of this. I think most of the time this is negligible anyway relative to the latency of the ld.global. Maybe I just couldn’t find an application that is limited by address calculations :)

Peter

Note that some optimizations like this (converting MULs by power-of-2s to shifts, and using 24-bit muls for addresses) are performed at a later stage in the driver, so you can’t always go by the PTX code.