How to operate irregular gemm on tensor core?

Take A100 as example, the wmma is for 161616 matrix multiplication. But how can I accelarate the matrix multiplicate by making a full use of tensor core on the shape of [1 * 256] * [256 * 256]? Am I need to padding 15 rows 0 per calculate?

that’s not a matrix-matrix multiply (shape) that tensor core (TC) is designed for. I would call it a vector-matrix multiply, and TC will not give interesting speed-up there.

There are no low-level TC mma operations that have a dimension of 1 (smallest would be 8, for FP16). So you can pad to 8 rows if you wish, and then issue a cublas gemm style call to do the matrix-matrix multiply, and use only 1 row of the result. I don’t think that will be quicker than doing a gemv sort of call, but you could try it. That is going to cut the effective TC throughput (GFLOP/s) by at least a factor of 8, which will bring it down into the range of what you can get with ordinary FP16 arithmetic on a GPU. For example, A100 has 312TF of non-sparse FP16 TC theoretical throughput, and the theoretical FP16 non-TC throughput is at least twice that of FP32, so ~40TF.

When doing linear algebra, I don’t recommend writing your own CUDA code for it. Use an available library such as CUBLAS or CUTLASS.

Formally, cublas gemv doesn’t support FP16 type. So I would suggest trying a GemmEx op.

If the shape you have suggested is actually the size of interest, and if you have many of those to do at the same time, cutlass offers a threadblock level gemv which might be quickest. Otherwise the problem size you have is relatively small for modern GPUs.

Doing a lot of vector multiplications with the same matrix can be mathematically directly changed into (or expressed as) a matrix matrix multiplication.

Is that possible to use some linear algebra methods to reshape the matrix? I’ve try to figure it but failed

It’s not clear to me what you are trying to accomplish or what you mean by “reshape the matrix”.

I mean if the original matrix multiplication is [1 * 64] * [64 * 64] and I can use the result of matrix multiplication about [16 * 16] * [16 * 16] to get the original result(maybe with some extra add or subtract, but no multiply) .

I don’t want to increase multiply operations, but more add/subtract operations are fine.

No, you would need many vectors, which are multiplied to the same 64x64 matrix, in your example you have just one.

Thanks.

Just mathematically:

Matrix-matrix multiplication can be seen as many vector-vector scalar products of first and second vectors of size K.

With the special property that the first vector comes from a set of M vectors (each of size K)
and the second vector comes from a set of N vectors (each of size K).

The matrix-matrix multiplication multiplies all combinations and puts the results into a matrix of size MxN.

As each vector is reused many times for those operations, it saves a lot of data transfers. But you get the full performance advantage only, if you can reuse both sides of vectors. E.g. having many datasets and many filters and wanting to get all combinations. Sometimes a problem can be restructured to look this way, e.g. polyphase filters.

One can also apply some tricks to make it work, by using a wider K than necessary, and setting some elements of the coefficient vectors to 0 to select, which elements to use. E.g. one could implement a gliding filter (convolution) over some dataset. The sparse feature helps with those use cases.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.