Ampere single-bit MMA missing from Linux SDK (v11.6)?

Hello, I’m trying to use the single-bit 16x8x128 matrix multiply operation that is sold as a feature of Ampere but the SDK seems to only support what is available on Turing (8x8x128). Am I crazy? I’ve manually checked my -arch flag, the CUDA mma.h header, and I’ve even tried manually creating the expected PTX code to no avail. It all seems to point to the feature being missing in the compiler. What’s going on?

The documentation seems to be inconsistent. The overview table (9.7.13.1. Matrix Shape) in the PTX manual lists for Ampere:

Multiplicand type: .b1; Shape: .m8n8k128, .m16n8k128, and .m16n8k256

but the description of the instruction itself doesn’t show all those shapes:

Single-bit (.b1 multiplicands) wmma.mma:
wmma.mma.op.popc.sync.aligned.row.col.shape.s32.atype.btype.s32 d, a, b, c;
.shape = {.m8n8k128};
.atype = {.b1};
.btype = {.b1};
.op = {.xor , .and}

I assume you are using the latest CUDA version (11.6 update 2)? Have you tried the Ampere-specific shape attributes advertised by the overview table, and .m16n8k128 in particular? Does using theses attributes in inline PTX code generate compilation errors?

The best course of action seems to be filing a bug with NVIDIA, so they can sort it out internally.

The bug report discussion was clarifying. At the PTX level there are two matrix multiplication instructions: wmma and mma. The former is older and more limited but the latter isn’t surfaced in the C++ SDK.

What makes this especially confusing is that the C++ SDK docs and headers use “MMA” and “WMMA” somewhat interchangeably, thus giving the false impression that the SDK can automatically output either PTX instruction.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.