Direct access to Volta HMMA instruction

Hi,
is there a known way to access SASS HMMA instruction from PTXAS without using builtins (such as __hmma_XXX ones)? The reason is that 4x4 matrix multiplication would yield a huge performance bonus for my code, while 16x16 operations (in the way they are implemented) are truly worthless. At least, I was unable to find any suitable PTX instruction, so PTX assembler keeps saying ‘Not a name of any known instruction’ :(

On the other side, nvdisasm does know this instruction:
/03d0/ HMMA.884.F32.F32.STEPx R10, R16.reuse.COL, R2.reuse.ROW, R10;
where x=0,1,2,3, which is exactly what I need.

Thanks.

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions

It’s just a pity that they only expose 16x16 matrix operations in their public API.

If you need to work with 4x4, 8x4 or 4x8 matrices you’re sort of lost. We do a lot of radio simulation at my employer and MIMO antenna systems with 4 or 8 antennas are very common. What a waste of tensor capacity to cram it into a 16x16 matrix and pad it with zeros ;)

Also to implement more than just multiplications (for example matrix inversions which are very important in MIMO receiver system modeling) I’d need row operations (weighted multiply+add of one matrix row to another row) mapped to the tensor cores.

I’d welcome if some genius wizard could figure out and document the SASS instructions to allow for more flexible tensor core programming.

Christian

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions

I need to do a straightforward 4x4 multiplication within a single thread. All the data is stored in registers. Using shared memory to combine data as 16x16 matrices, then multiplying using artificial wmma high-level API, and then getting data back… well, it just kills all the performance.
What is necessary is to open access to four HMMA-instructions that do the job.

Feeling disappointed, considering shipping my Volta cards back 8(

If you think the part doesn’t work for your use case, why not? I am sure there are others waiting to snap it up.

I agree that NVIDIA’s failure to commit to (and support) an ISA with backward compatibility is a pain in the behind for ninja programmers. Because there is no binary compatibility at the hardware level between architectures, everything needs to be routed through PTX as the virtual ISA, which means not all hardware features get exposed directly, presumably for fear on the part of PTX maintainers that the next hardware generation will remove or significantly modify underlying hardware support (which has happened in the past, e.g. Kepler’s SIMD video instructions).

On the flip-side, the lack of a binary compatible ISA ensures a maximum pace of innovation and performance growth, which helps the average CUDA user.

You could certainly try and file an RFE (enhancement request) with NVIDIA to get the HMMA instruction exposed in a way that better fits your needs. RFE can be filed through the bug reporting form, simply prefix the subject line with RFE to mark it as an enhancement, rather than a functional bug.

Keep in mind that when working with matrix outer products that are this small, you are more than likely in the memory bound regime. The size of the outer products determine the potential for data reuse and hence the level of compute intensity available. Tensorcores assume an enormous amount of compute intensity to be able to run at peak efficiency. That being said, having matmul primitives on the hardware can at least simplify the code quite a bit and there could be some value there.

Anyway, adding a few multiply add instructions instead of doing the op in one shot isn’t so onerous and it’s worth bench-marking to see if you’re anywhere close to peak single precision performance. If not, a tensorcore isn’t going to magically fix that for you.

If you’re looking for some example code that implements small tile matrix multiplication you can checkout the blocksparse primitives I released recently:

Looking at Nvidia’s CUTLASS might be another good source of inspiration.

That’s a great idea! Thanks, I will.

Technically speaking, it is not a few MAD instructions, but dozens of them. Also, using 16x16 is extremely redundant for me as it required additional registers, additional synchronization and shared memory.

Nonetheless, Scott Gray’s suggestion of creating prototype code (and then profiling it) to see how far you can get with just single-precision FMAs has merit.

maybe even consider exploiting the 2x speedup obtainable with the half2 math arithmetic functions, all documented here

http://docs.nvidia.com/cuda/cuda-math-api/index.html

For example __hfma2 is a fused multiply add that can accelerate matrix operations if the source data is already available in half precision. The main drawback is that this can not accumulate results in an FP32 register, if this feature important to you.

compute part is 20x longer than data loading one, verified. Also, the data might me scattered in memory while wmma required input and output data to be in contiguous memory blocks