inline __device__ float4 hmma_fp32(const uint2& a, unsigned int b)
{
float4 c;
float zero = 0.f;
asm volatile("mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 \n"
" {%0, %1, %2, %3}, \n"
" {%4, %5}, \n"
" {%6}, \n"
" {%7, %7, %7, %7}; \n"
: "=f"(c.x), "=f"(c.y), "=f"(c.z), "=f"(c.w)
: "r"(a.x) "r"(a.y), "r"(b), "f"(zero));
return c;
}
The last line of the asm code losting a comma between "r"(a.x) and "r"(a.y)
This code can be built by nvcc: Compiler Explorer
But according to the CUDA document 1. Using Inline PTX Assembly in CUDA — inline-ptx-assembly 12.2 documentation , it said:
“… where you can have multiple input or output operands separated by commas.”
So I think nvcc should emit error here.
Just like what clang does: Compiler Explorer