Has anyone benchmarked the VMAD.U16.U16[.SHR15] operation on Maxwell?

I’m looking at the SHR15 variant and see that it does compile to a single SASS op but I’m wondering about the throughput.

Otherwise, I’ll write a benchmark. :)

Has anyone benchmarked the VMAD.U16.U16[.SHR15] operation on Maxwell?

I’m looking at the SHR15 variant and see that it does compile to a single SASS op but I’m wondering about the throughput.

Otherwise, I’ll write a benchmark. :)

As VMAD.U16.U16 it’s full throughput. It’s just an alias for XMAD. But with the SHR15 flag it drops down to half throughput. In general, if an instruction involves a shift in some way than it’s half throughput. The VMAD.U8.U8 instruction is also half throughput (but looking forward to sm_61 to finally have decent native low precision support)

Ah, thanks!

Half throughput is a bummer but it’s still super fast. :)

I was trying to figure whether a MAD.WIDE.U16 + SHR.U32(15) pairing and repack was any better than a VMAD.SHR_15 pairing and repack:

```
XMAD R3, R0.reuse, R1.reuse, R2;
XMAD R4, R0.H1, R1.H1, R2;
SHR.U32 R3, R3, 0xf;
SHR.U32 R4, R4, 0xf;
XMAD.PSL.CLO R3, R4, 0x1, R3;
```

vs.

```
VMAD.U16.U16.SHR_15 R3, R0.reuse, R1.reuse, R2;
VMAD.U16.U16.SHR_15 R4, R0.H1, R1.H1, R2;
XMAD.PSL.CLO R3, R4, 0x1, R3;
```

It sounds like they’re going to be similar.

I’ll have to try it.

I count 7 clocks for the first and 5 for the second, not counting any stalls needed to satisfy dependencies. Presumably your occupancy is high enough to occupy those stalls. Though come to think of it, they may run at about the same speed as the scheduler will see the pipe busy after a shift op and possibly schedule an xmad from another warp. But fewer instructions is generally better for instruction cache performance.

I use inline ptx vmads all the time when I know my a/b operands fit in 16 bits. Last I checked the compiler just assumes all xmad operands are 32 bits and generates the maximum number of instructions needed (usually 3).

btw, i still wonder why nvidia cannot restore good ol mad24. it’s single-cycle (as part of fp32 mad) and has so many usages, in particular for index calculation

Sorry I goofed… you also get pipeline mixing within the same warp. So the second shift only consumes 1 clock in each. So 6 vs 4. Then assuming perfect mixing from other warps the throughput is simply the number of instructions. Unless all your other warps look the same and you have a bunch of these in sequence. Then the second one is limited by the shift throughput and will max out at 4 clocks.

Thanks for the clock counts.

Using the VMAD instruction, it looks like you can cram two “unit interval” ([0.0,1.0]) numbers into 32-bits by representing each with the range [0-32768] and exploiting the SHR_15.

I suppose that’s one of the things the op was originally designed to do. :)

A MAD operation looks like this:

```
DEVICE_STATIC_INTRINSIC_QUALIFIERS
q16v2
mad_q16v2(union q16v2 a, union q16v2 b, union q16v2 c)
{
u32 d,e;
asm("vmad.u32.u32.u32.shr15 %0, %1.h0, %2.h0, %3;" : "=r"(d) : "r"(a.lohi), "r"(b.lohi), "r"(32768 * c.lo));
asm("vmad.u32.u32.u32.shr15 %0, %1.h1, %2.h1, %3;" : "=r"(e) : "r"(a.lohi), "r"(b.lohi), "r"(32768 * c.hi));
q16v2 r;
r.lo = d;
r.hi = e;
return r;
}
```

And compiles to 5 instructions:

```
LOP32I.AND R3, R2, 0xffff;
BFE.U32 R4, R2, 0x1010;
SHL R3, R3, 0xf;
SHL R6, R4, 0xf;
VMAD.U16.U16.SHR_15 R3, R0.reuse, R1.reuse, R3;
VMAD.U16.U16.SHR_15 R4, R0.H1, R1.H1, R6;
```

I’m still permuting solutions in my head and think there are even better solutions for this given that I’m representing the unit interval with fixed point and can make additional assumptions.

For example, NVCC does a good job with vanilla C and uses plain XMADs and LEAs.

*Later…*

In the code above, shifting the addend to the left doesn’t gain you anything since this is a fixed point calculation. It’s best to implement a MAD as a MUL + ADD and, if your use case allows, you can cheat and use a scalar ADD.U32.

A few hours of experimentation shows that VMAD.SHR15 is really fast and, at least in my use case, much faster than the C equivalent.

I wrote several packed MAD and MUL operations and the fastest always used VMAD.SHR15.