address calculation in cuda How to calculation address efficiently?

darkbuck · April 12, 2007, 10:29am

after checking the generated PTX file, the address of array element is calculated using MUL instruction and the sequence looks as follows:

mov.u32 $r0, (&A); # load array base

mul.lo.u32 $r2, $r1, 4; # calculate the index for unsigned int array
mul.lo.u32 $r2, $r1, 16; # calculate the index for uint4 array

add.u32 $r3, $r2, $r0;

ld.global.v1.u32 [$r3+0], $r4; # load unsigned int
ld.global.v4.u32 [$r3+0], {$r4, $5, $6, $7}; # load uint4

The programming guide states that MUL takes 8 cycles compared to 2 cycles of regular integer operations. Is it possible to direct the compiler generate the SHL instead of MUL to calculate the index?

DB

prkipfer · April 12, 2007, 12:51pm

Try calculating the pointer to the final first uint yourself using __mul24 and cast the result to uint4. I played around with that but couldn’t get a definitive speedup because of this. I think most of the time this is negligible anyway relative to the latency of the ld.global. Maybe I just couldn’t find an application that is limited by address calculations :)

Peter

Simon_Green · April 12, 2007, 1:02pm

Note that some optimizations like this (converting MULs by power-of-2s to shifts, and using 24-bit muls for addresses) are performed at a later stage in the driver, so you can’t always go by the PTX code.