Wmma f16 load always loads into 8 2xf16 registers

I was reading the CUDA documentation about wmma. It seems that wmma.load always loads eight 2xf16 registers if A or B is in f16 (1. Introduction — parallel-thread-execution 8.2 documentation)

Here is an example from the documenation:
wmma.load.a.sync.aligned.row.m16n16k16.f16 {x0,…,x7}, [p], s;

I was a bit confused. For this example, A has 256 elements so each thread would load 8 elements. So it would only need to load to four 2xf16 registers. Why would the instruction need eight 2x16 registers?

  • Fragment size in bytes = 32 (eight elements of type .f16x2)

It appears that the loads for .f16 don’t specify vector variants for the different shapes. The .m32n8k16 and .m8n32k16 shapes will require 512 elements, not 256, across the warp, for A and B respectively. The compiler will tell you what is actually acceptable for a given instruction.

Thanks for the explanation. So it seems that this is because the instruction format applies to different shapes. Then a follow-up question is:
in the example of wmma.load.a.sync.aligned.row.m16n16k16.f16 {x0,…,x7}, [p], s,
as the instruction format requires eight 2xf16 registers (say, {x0,…,x7}), but it actually only needs four 2x16 registers, will only {x0,…,x3} be used and {x4,…,x7} could be used for other purposes?

My suggestion is to code up an example and see what the compiler actually tells you. My expectation is that the compiler will tell you it wants a length 4 vector for A or B in the m16n16k16 case, that it will tell you it wants a length 8 vector for the A case of m32n8k16 and a length 2 vector for the B case, and vice-versa for the m8n32k16 case.

I withdraw my comments. I concur that the operation seems to require storage for 512 elements warp-wide for the .m16n16k16 shape, for both a and b. I don’t have an explanation for what the extra elements are for. They seem to be necessary however as part of the unspecified fragment storage pattern. Therefore, I make no comment about which exact elements are “used” and whether others could be used for some other purpose. Given that the fragment storage pattern is explicitly identified as “unspecified” I think it is not possible to safely use the unused elements, and furthermore there is no statement as to which of the elements are unused. Even if you deduced it with inspection, there is no guarantee that the storage pattern would not change from one CUDA version to the next.

You can always ask for clarification of CUDA documentation by filing a bug.


# cat t41.cu
#include <cuda_fp16.h>
__global__ void k(half *a){

  unsigned out[8];
  asm volatile("wmma.load.a.sync.aligned.row.m16n16k16.f16 "
                     "{ %0, %1, %2, %3, %4, %5, %6, %7 },"
                     : "=r"(out[0]),"=r"(out[1]),"=r"(out[2]),"=r"(out[3]), "=r"(out[4]),"=r"(out[5]),"=r"(out[6]),"=r"(out[7])
                     : "l"(a));

# nvcc -c t41.cu -arch=sm_70
t41.cu(4): warning #550-D: variable "out" was set but never used
    unsigned out[8];

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"