enforcing dual-issue by mixing fp and integer arithmetic


I have a kernel that extensively uses mixed floating-point & integer computations,

as shown in the following disassembly:

mul24.hi.u32 $r16, $r4, $r8

cvt.rz.f32.u32 $r16, $r16

mul.rz.f32 $r16, $r16, $r1

mad24.lo.u32 $r17, $r4, $r8, $r3

cvt.rzi.u32.f32 $r4, $r16

mad24.lo.u32 $r4, -$r0, $r4, $r17

mad24.lo.u32 $r16, -$r6, $r11, $r4

mul24.hi.u32 $r4, $r6, $r11

cvt.rz.f32.u32 $r4, $r4

mul.rz.f32 $r4, $r4, $r1

cvt.rzi.u32.f32 $r4, $r4

mad24.lo.u32 $r4, $r4, $r0, $r16

cvt.rn.f32.u32 $r11, $r4

mad.rm.f32 $r11, $r11, $r2, $r1

mad24.lo.u32 $p1|$r4, -$r11, $r0, $r4

@$p1.sf add.u32 $r4, $r0, $r4

mul24.hi.u32 $r11, $r4, $r9

cvt.rn.f32.u32 $r11, $r11

mul.rn.f32 $r11, $r11, $r1

cvt.rzi.u32.f32 $r11, $r11

mul24.lo.u32 $r11, $r0, $r11

mad24.lo.u32 $p1|$r16, $r4, $r9, -$r11

mov.half.b32 $r4, $r0

afaik, the GPU supports ‘dual-issue’, i.e. the SM can execute instructions

in different functional units in parallel

then, suppose if I rearrange the code s.t. there are no read-after-write

dependencies, would it be possible to benefit from this feature

or floating-point and integer instructions are executed on the same unit ?


Yes, the MAD unit executes both floating-point and integer instructions. The second, SF Unit, can execute floating-point multiplications and moves.

However, when two instructions are “dual-issued”, they always belong to different warps. Load-balancing will occur naturally when the warp scheduler will select an instruction from a warp to dispatch it to an unit which is free.

There is still a benefit from reducing the read-after-write dependencies, when you have a very low occupancy. But this is related to pipelining rather than dual-issue.

Focusing on reducing the number of instructions (and memory accesses) executed still has a much higher impact than trying to eliminate dependencies…

thanks, I see. You’re probably referring to double-precision when you talk about mul on the SFU…

Actually I tried to “scramble” my code a bit to reduce read-after-write dependencies

but the compiler is aggressive in reducing register usage, so it tends to reuse free registers as much as

possible neglecting my optimizations (this is what I observe in decuda).

Perhaps I need to play around with ‘registercount’ parameter to get some effect

I mean single-precision. The SFU can do transcendental ops, graphics attribute interpolation (not in CUDA), register move, and an anecdotal single-precision floating-point multiplication.

(Which has been mostly overemphasized, because of a controversy on whether it should get accounted or not on the peak flop rate.)

I don’t think it is worthwhile to do this… You’ll probably spend more time fighting the compiler than it would have taken coding it in asm directly, and everything will need to be redone with each upgrade of the compiler.

Among all the modern computer architectures, the (current) GPUs are probably the ones that are the less sensitive to read-after-write dependencies… So I’d say: don’t worry too much about it…

There are usually many other opportunities for optimizations. :)