Inline PTX and WMMA instructions

Some WMMA instructions are supposed to take a sequence of 8 values together, like so:

wmma.load.a.sync.aligned.row.m16n16k16.global.f16       {%r26, %r27, %r28, %r29, %r30, %r31, %r32, %r33}, [%rd4], %r25;

Now, here is a code for one such instruction. It clearly ignores that requirement, and fails. Question is - how can I fix it?

void __hmma_m16n8k16_ld_a(
    int* __restrict fragment,
    const int* __restrict source_for_fragment,
    unsigned stride_in_elements_between_consecutive_rows)
{
    asm("wmma.load.a.sync.aligned.row.m16n8k16.f16 %0, [%1], %2;"
	:
        : "l"(fragment), "l"(source_for_fragment), "r"(stride_in_elements_between_consecutive_rows));
    }
}

Notes:

  • Similar, though not identical, question posted on Stackoverflow.
  • The choice of signature for that function is due to what’s common in crt/mma.hpp, don’t blame me…

There is no such instruction. See here, the supported shapes are:

.shape = {.m16n16k16, .m8n32k16, .m32n8k16};

I recommend switching to mma instruction type, if you need this shape (m16n8k16), and I have given an answer here on SO about how to do that.

Hopefully it is self-evident that not all tensorcore functionality is exposed in the wmma family of instructions.