I withdraw my comments. I concur that the operation seems to require storage for 512 elements warp-wide for the .m16n16k16 shape, for both a and b. I don’t have an explanation for what the extra elements are for. They seem to be necessary however as part of the unspecified fragment storage pattern. Therefore, I make no comment about which exact elements are “used” and whether others could be used for some other purpose. Given that the fragment storage pattern is explicitly identified as “unspecified” I think it is not possible to safely use the unused elements, and furthermore there is no statement as to which of the elements are unused. Even if you deduced it with inspection, there is no guarantee that the storage pattern would not change from one CUDA version to the next.
You can always ask for clarification of CUDA documentation by filing a bug.
example:
# cat t41.cu
#include <cuda_fp16.h>
__global__ void k(half *a){
unsigned out[8];
asm volatile("wmma.load.a.sync.aligned.row.m16n16k16.f16 "
"{ %0, %1, %2, %3, %4, %5, %6, %7 },"
"[%8];\n"
: "=r"(out[0]),"=r"(out[1]),"=r"(out[2]),"=r"(out[3]), "=r"(out[4]),"=r"(out[5]),"=r"(out[6]),"=r"(out[7])
: "l"(a));
}
# nvcc -c t41.cu -arch=sm_70
t41.cu(4): warning #550-D: variable "out" was set but never used
unsigned out[8];
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
#