Interesting… good to know.
Still, a PTX cheat sheet would be nice to have (e.g., for debugging).
For example: does mul.lo.u32 multiply (unsigned) the low 16 bits of the operands together into an 32-bit word?
In this snippet, which is generating an index into global memory array of ints, parm_output. The index is threadIdx.x + blockIdx.x * blockDim.x.
$r15 gets the product blockIdx.x * blockDim.x using mul.lo.u32, which makes sense because the block dimensions are limited to 65535.
$r16 gets the final index, adding threadIdx.x to the product above. This value could very well be larger than 2^16.
$r17 gets the offset into global memory, again using mul.lo.u32 to multiply the index ($r16) by sizeof(int) (4).
But since $r16 could be greater than 2^16, is mul.lo.u32 appropriate? Or do I misunderstand mul.lo.u32?
# parm_output[threadIdx.x + blockIdx.x * blockDim.x] = intval;
ld.param.u32 $r11, %parm_output; # id:30 %parm_output+0x0
cvt.u32.u16 $r12, %tid.x; #
cvt.u32.u16 $r13, %ctaid.x; #
cvt.u32.u16 $r14, %ntid.x; #
mul.lo.u32 $r15, $r13, $r14; # blockIdx.x * blockDim.x
add.u32 $r16, $r12, $r15; # threadIdx.x + (blockIdx.x * blockDim.x)
mul.lo.u32 $r17, $r16, 4; # ... * sizeof(int) ???
add.u32 $r18, $r11, $r17; #
st.global.s32 [$r18+0], $r10; # id:34