after checking the generated PTX file, the address of array element is calculated using MUL instruction and the sequence looks as follows:
mov.u32 $r0, (&A); # load array base
mul.lo.u32 $r2, $r1, 4; # calculate the index for unsigned int array
mul.lo.u32 $r2, $r1, 16; # calculate the index for uint4 array
add.u32 $r3, $r2, $r0;
ld.global.v1.u32 [$r3+0], $r4; # load unsigned int
ld.global.v4.u32 [$r3+0], {$r4, $5, $6, $7}; # load uint4
The programming guide states that MUL takes 8 cycles compared to 2 cycles of regular integer operations. Is it possible to direct the compiler generate the SHL instead of MUL to calculate the index?
- DB