Direct access to Volta HMMA instruction

alexanderguzhva · December 14, 2017, 6:30pm

Hi,
is there a known way to access SASS HMMA instruction from PTXAS without using builtins (such as __hmma_XXX ones)? The reason is that 4x4 matrix multiplication would yield a huge performance bonus for my code, while 16x16 operations (in the way they are implemented) are truly worthless. At least, I was unable to find any suitable PTX instruction, so PTX assembler keeps saying ‘Not a name of any known instruction’ :(

On the other side, nvdisasm does know this instruction:
/03d0/ HMMA.884.F32.F32.STEPx R10, R16.reuse.COL, R2.reuse.ROW, R10;
where x=0,1,2,3, which is exactly what I need.

Thanks.

tera · December 14, 2017, 8:48pm

[url]PTX ISA :: CUDA Toolkit Documentation

cbuchner1 · December 14, 2017, 9:39pm

It’s just a pity that they only expose 16x16 matrix operations in their public API.

If you need to work with 4x4, 8x4 or 4x8 matrices you’re sort of lost. We do a lot of radio simulation at my employer and MIMO antenna systems with 4 or 8 antennas are very common. What a waste of tensor capacity to cram it into a 16x16 matrix and pad it with zeros ;)

Also to implement more than just multiplications (for example matrix inversions which are very important in MIMO receiver system modeling) I’d need row operations (weighted multiply+add of one matrix row to another row) mapped to the tensor cores.

I’d welcome if some genius wizard could figure out and document the SASS instructions to allow for more flexible tensor core programming.

Christian

alexanderguzhva · December 15, 2017, 7:01pm

PTX ISA 8.3

I need to do a straightforward 4x4 multiplication within a single thread. All the data is stored in registers. Using shared memory to combine data as 16x16 matrices, then multiplying using artificial wmma high-level API, and then getting data back… well, it just kills all the performance.
What is necessary is to open access to four HMMA-instructions that do the job.

Feeling disappointed, considering shipping my Volta cards back 8(

njuffa · December 15, 2017, 7:17pm

If you think the part doesn’t work for your use case, why not? I am sure there are others waiting to snap it up.

I agree that NVIDIA’s failure to commit to (and support) an ISA with backward compatibility is a pain in the behind for ninja programmers. Because there is no binary compatibility at the hardware level between architectures, everything needs to be routed through PTX as the virtual ISA, which means not all hardware features get exposed directly, presumably for fear on the part of PTX maintainers that the next hardware generation will remove or significantly modify underlying hardware support (which has happened in the past, e.g. Kepler’s SIMD video instructions).

On the flip-side, the lack of a binary compatible ISA ensures a maximum pace of innovation and performance growth, which helps the average CUDA user.

You could certainly try and file an RFE (enhancement request) with NVIDIA to get the HMMA instruction exposed in a way that better fits your needs. RFE can be filed through the bug reporting form, simply prefix the subject line with RFE to mark it as an enhancement, rather than a functional bug.

scottgray · December 15, 2017, 7:52pm

Keep in mind that when working with matrix outer products that are this small, you are more than likely in the memory bound regime. The size of the outer products determine the potential for data reuse and hence the level of compute intensity available. Tensorcores assume an enormous amount of compute intensity to be able to run at peak efficiency. That being said, having matmul primitives on the hardware can at least simplify the code quite a bit and there could be some value there.

Anyway, adding a few multiply add instructions instead of doing the op in one shot isn’t so onerous and it’s worth bench-marking to see if you’re anywhere close to peak single precision performance. If not, a tensorcore isn’t going to magically fix that for you.

If you’re looking for some example code that implements small tile matrix multiplication you can checkout the blocksparse primitives I released recently:

github.com

openai/blocksparse/blob/master/src/blocksparse_matmul_op_gpu.cu


#if GOOGLE_CUDA

// #include <stdio.h>
#include "ew_op_gpu.h"
#include <stdio.h>

template <bool Fprop, typename TW, typename TX, typename TY>
__global__ void __launch_bounds__(32) gemm_blocksparse_08x64x08x8_xprop(
    const  int2* __restrict__ Lut,
    const    TW* __restrict__ W,
    const    TX* __restrict__ X,
    TY* Y, int* Lock, int locks, int N /* N is in units of groups of 8 elements each (N/8) */)
{
    if (Fprop)
        asm(".shared .align 16 .b32 share[576];" ::); // 576 =  8*8 + 64*8
    else
        asm(".shared .align 16 .b32 share[608];" ::); // 608 = 12*8 + 64*8

This file has been truncated. show original

Looking at Nvidia’s CUTLASS might be another good source of inspiration.

alexanderguzhva · December 16, 2017, 2:07am

That’s a great idea! Thanks, I will.

Technically speaking, it is not a few MAD instructions, but dozens of them. Also, using 16x16 is extremely redundant for me as it required additional registers, additional synchronization and shared memory.

njuffa · December 16, 2017, 3:02am

Nonetheless, Scott Gray’s suggestion of creating prototype code (and then profiling it) to see how far you can get with just single-precision FMAs has merit.

cbuchner1 · December 16, 2017, 9:20pm

maybe even consider exploiting the 2x speedup obtainable with the half2 math arithmetic functions, all documented here

http://docs.nvidia.com/cuda/cuda-math-api/index.html

For example __hfma2 is a fused multiply add that can accelerate matrix operations if the source data is already available in half precision. The main drawback is that this can not accumulate results in an FP32 register, if this feature important to you.

alexanderguzhva · December 19, 2017, 8:02pm

compute part is 20x longer than data loading one, verified. Also, the data might me scattered in memory while wmma required input and output data to be in contiguous memory blocks

Topic		Replies	Views
The HMMA.884 tensor core instruction seems not match with its cuda warp-level mma instruction CUDA Programming and Performance	5	486	August 22, 2024
PTX instruction `mma` not lowered to tensor core related SASS instruction TensorRT	2	1408	March 22, 2022
How many tensor cores to execute the wmma.mma.sync.aligned.{alayout}.{blayout}.m16n16k16 instruction？ CUDA Programming and Performance cuda	23	368	December 12, 2025
Questions about mma instruction with Nvidia ptx CUDA Programming and Performance cuda	1	229	July 15, 2024
Number of floating point operations in one HMMA instruction Nsight Compute cuda	2	1500	May 20, 2024
How does it compute exactly in Tensor Core? CUDA Programming and Performance	10	1457	August 22, 2024
Error or incomprehension, MMa ptx mixed precision Bfloat16 rtx3080 CUDA Programming and Performance	20	2681	October 12, 2021
How does 4x4 mma at tensor core level translate to 16x16 mma at warp level? CUDA Programming and Performance cuda	2	1174	November 15, 2023
Why does WMMA and MMA support different matrix tile size? CUDA Programming and Performance	2	2247	October 28, 2023
Can we directly use register value for tensor core calculation? CUDA Programming and Performance	4	775	October 18, 2023

Direct access to Volta HMMA instruction

Related topics