Hi,
I have a kernel that extensively uses mixed floating-point & integer computations,
as shown in the following disassembly:
mul24.hi.u32 $r16, $r4, $r8
cvt.rz.f32.u32 $r16, $r16
mul.rz.f32 $r16, $r16, $r1
mad24.lo.u32 $r17, $r4, $r8, $r3
cvt.rzi.u32.f32 $r4, $r16
mad24.lo.u32 $r4, -$r0, $r4, $r17
mad24.lo.u32 $r16, -$r6, $r11, $r4
mul24.hi.u32 $r4, $r6, $r11
cvt.rz.f32.u32 $r4, $r4
mul.rz.f32 $r4, $r4, $r1
cvt.rzi.u32.f32 $r4, $r4
mad24.lo.u32 $r4, $r4, $r0, $r16
cvt.rn.f32.u32 $r11, $r4
mad.rm.f32 $r11, $r11, $r2, $r1
mad24.lo.u32 $p1|$r4, -$r11, $r0, $r4
@$p1.sf add.u32 $r4, $r0, $r4
mul24.hi.u32 $r11, $r4, $r9
cvt.rn.f32.u32 $r11, $r11
mul.rn.f32 $r11, $r11, $r1
cvt.rzi.u32.f32 $r11, $r11
mul24.lo.u32 $r11, $r0, $r11
mad24.lo.u32 $p1|$r16, $r4, $r9, -$r11
mov.half.b32 $r4, $r0
afaik, the GPU supports ‘dual-issue’, i.e. the SM can execute instructions
in different functional units in parallel
then, suppose if I rearrange the code s.t. there are no read-after-write
dependencies, would it be possible to benefit from this feature
or floating-point and integer instructions are executed on the same unit ?
thanks