About sparse mma on A100

I’ve read nvidia-ampere-architecture-whitepaper and find that on A100 and I find the mma.sp instruction. I try to use it on sparse convolution (by convert convolution into gemm, like im2col or other method). But I find that as mma.sp need matrix A have many 0 in chunks of matrix row, it is really hard to use it for sparse convolution. Is there any way to solve it?

In my code, A is input feats, B is kernel. As for MNK, I use K to represent different in_channels of input feats (and also kernel), N for kernel’s out_feats. M for different points feats.

You want B to have zeroes? Then just exchange A and B. Mathematically you transpose A, B before, exchange A and B, and transpose D afterwards.

B^T x A^T = D^T

Actually, I want A have 0 in col, or B have 0 in row. Because that’s what im2col need for sparse convolution

But that is how normally sparse matrices work: A has zero in columns.

Or in other words: for each row of A several columns can be zero.

But for convolution, row indicates to different in_channels for input and column represent different points. If a point need to be calculate, all its in_channels needed to be calculated.

K is the dimension where the product and summation happens. And one of your factors (input or coefficients) needs to be sparse in that dimension, so that you have half as many operations (when leaving out the products with 0), but the same number of results.

Normally M x N x K products, M x N results
Sparse processing M x N x K/2 products, M x N results

One of your given matrix is a dense matrix (used as B), size is K x N or N x K

The other can half sparsity in each K-sized component vector (used as A), size is M x K or K x M

So there is no way to use mma.sp on calculation of convolution? I can’t see any possibility. As if I need to have 0 in dimension K, there must be 0 in different in_channels.

I don’t understand how can the matrix size be like K * M? Do you mean the layout in smem?

No, I mean the mathematical matrix size per MMA instruction.
Each single MMA instruction (not combining several) computes the matrix product A x B; and A and B each have a 2D size as they are 2D matrices.

If you do a 1D convolution, then you have a set of 1D input vectors (one dimension is the convolution dimension =K, the other dimension of the 2D matrix is the multitude of independent input vectors). And you have a 1D convolution filter, which you have to repeat in a transposed fashion, see below.

[My other answer got lost seemingly, technical forum problem.]

Either

a) your input data or your filter naturally has many zeroes and is sparse.

b) or for a convolution kernel, if the convolution dimension size of the overall problem is not >>K (K is the convolution size for a single mma instruction) or in other words a relatively small filter, then you can use the sparse matrix operations to an advantage for the zeroes around the kernel:

(In the following the K dimension is shown horizontally. Mathematically B would have the K diemension vertically instead.)

Input Data matrix:

I0 I1 I2 I3 I4 ... IN
...
[other input data sets]

Filter matrix:

F0 F1 F2 0 ...
0 F0 F1 F2 0 ...
0 0 F0 F1 F2 0 ...
...

The convolution filter matrix has the same filter moved by one element for each row (or column, if transposed).

As the requirement for Cuda sparse matrices (depending on data type) is that for each row each 4 columns can have a maximum of 2 zero elements (which are not computed), and here the 0 elements appear as blocks on the left and right side, we have to reorder the K dimension. It has to happen in the same way for both input matrix and filter matrix, even if only the filter matrix is sparse.

Example for reordering for K==16:
0 8 1 9 2 10 3 11 4 12 5 13 6 14 7 15

The first half appears as even elements, the second half as odd elements. Each by itself in the original order.

Now think about, if the originally left elements are all zero. E.g. elements 0…9.
Then each even (when starting to count with 0) element is zero.

If all right elements are zero. E.g. elements 6…15 then all odd elements are zero.

If the filter is in the middle, e.g. elements 0…4 and 11…15 are zero, then after reordering for some pairs of elements (first pair has original indices 0 8, second pair has indices 1 9, …) the even, for some pairs of elements the odd element is zero.

So it is usable with the cuda sparse matrix conditions.

c) If the convolution size is large compared to K than you probably can also use sparse matrices to an advantage, for the overlapping of multiple mma instructions.

Thank you so much for such detailed explain for sparse convolution. But in fact, I’m not going to calculate convolution with sparse kernel. Instead, with sparse input. let me take 4 * 4 * 4 matrix as example. What I really need is like below:
Matrix A (input matrix)

in_channel0 in_channel1 in_channel2 in_channel3
in_0_0 in_0_1 in_0_2 in_0_3
0 0 0 0
in_2_0 in_2_1 in_2_2 in_2_3
0 0 0 0

Matrix B (kernel matrix)

out_channel0 out_channel1 out_channel2 out_channel3
ker_0_0 ker_0_1 ker_0_2 ker_0_3
ker_1_0 ker_1_1 ker_1_2 ker_1_3
ker_2_0 ker_2_1 ker_2_2 ker_2_3
ker_3_0 ker_3_1 ker_3_2 ker_3_3

each row of A represent a point. each row of B indicate an in_channel.
In sparse convolution, a kernel have very sparse points to calculate. But to fulfill the requirements of tensor core, lot of padding are generated.

In your case just leave out the empty rows in A. Each row in A respectively goes into one row of the result D. The left out result rows would be zero (or just contain the value from C, which is added).

Just use dense matrices for the filled rows of A.

If you do not have enough points to compute in parallel: Sorry, some computation wasted. Sparse won’t help.
Perhaps except with one small trick. But probably you have enough points.

What you do (at least as presented), it does not look like a convolution, just a vector or matrix multiplication. But perhaps you do it after a Fourier Transform (Fourier Theorem: convolution in space is multiplication).

Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.