multi-threading with cuSPARSE lib

Hello,

I’m trying to calculate a multiplication between a sparse matrix and a dense matrix with cusparseScsrmm() on TX1.
Sparse matrix A is of size (64, 288) and dense matrix B is of size (288, 1000)
However I think the execution time is too long compared to cuDNN.
I thought maybe the operation isn’t parallized. I’m not familiar with CUDA, but I saw the notion “stream” which allows small tasks to be executed with overlap.
But I’m not sure what’s the story about cusparseHandle_t and cudaStream_t.
My idea was to seperate matrix B to 10 small matrix, 10 x (288, 100), then I could create 10 streams and each one of them handles one multiplication.
My question is:
(1) Do I need to create 10 cusparseHandle_t so that I could assign one handle to each cudaStream?
(2) Since my sparse matrix A is created with one cusparseHandle, will A’s pointer be shared with or be able to be accessed by the other 9 cusparseHandle (if I do need to create 10)?
(3) How to measure the execution time in this case?

Any help will be appreciated, thanks a lot guys ;)

your suggestion is a bad idea and will not give you any speed up.

Hi txbob,
The original idea is to calculate one CNN convolution layer with cuSPARSE.
Input matrix B is of size 208x208x32, and sparse weight matrix A is of size 64x32x3x3.
In order to call cusparse library, I converted the weight matrix to a 2D matrix 64x288, and input matrix B to 2D matrix 288x43264, so that I can call cusparseScsrmm() which calculates AxB.
But the calculation time is much longer than cuDNN on TX1.
What might be the problem? And how could I improve it?
I’m new to cuSPARSE, thanks a lot for your help

Is cusparseScsrmm() the right call to calculate convolution in this case?

CUDNN is likely to be dramatically faster than a naive usage of CUSPARSE, for computing convolutional layers, as you’ve already discovered.

That is the crown jewel of CUDNN. In a nutshell, it converts a convolutional layer into a compact data set that can use dense linear algebra (matrix-matrix multiply from CUBLAS, effectively). For a supported comparable operation, this will be significantly faster than anything you can come up with using CUSPARSE.

Writing fast neural network calculation code is non-trivial, and a full treatment couldn’t be covered in the space of a forum thread here. If you want to recreate the performance of CUDNN using CUSPARSE (or not using CUDNN) that is either difficult or effectively impossible.

I would suggest you use CUDNN. You’ve already identified the value of it, compared to your implementation.

In fact, I used CUDNN to calculate dense format convolution. Does it support sparse weight matrices convolution?
If not, how am I suppose to calculate a convolution layer when my weight matrices are sparse?

I’m a bit confused here…I thought CUSPARSE is the only library that handles sparse calculation.
Cause according to this post, CUDNN doesn’t interact with CUSPARSE…
https://devtalk.nvidia.com/default/topic/997345/implementing-cusparse-within-cudnn/

I’m confused too. You said this:

I haven’t studied your problem carefully, but based on the above comment I thought you had already realized and tested an implementation with CUDNN and found that it ran faster. If that is the case, my suggestion is to do that.

Your input and weight matrices don’t look that large to me. And I’m not really sure exactly what you mean by sparse matrix. Compared to (the weight matrix for) a FC/IP layer, (the weight matrix for) a Convolutional layer is “sparse” and that is the meaning I took. In that sense, CUDNN is probably going to be faster than anything you come up with.

If that’s not what you’re talking about, then feel free to ignore my comments.

If you’re talking about a convolutional layer where even the convolutional masks are mostly populated by zero values, for the size of problem that you seem to be talking about, I would still try to use CUDNN, acknowledging that you are going to be passing a weight matrix with a lot of zeroes.

For a problem (matrix size) in which the matrix could be represented either as a dense matrix (i.e. will fit in memory) or a sparse matrix, CUSPARSE is often not faster than CUBLAS in spite of the large amount of additional “work” that CUBLAS would do to provide a similar result. CUSPARSE becomes advantageous first and foremost when the equivalent dense matrix itself will not fit in memory, and so a sparse realization becomes more or less mandatory.

So CUDNN, whose crown jewel is to convert a convolutional layer into something that can be efficiently submitted to CUBLAS, may still be a win over any attempt to solve the problem using CUSPARSE.

Thanks for your patience and detailed response.That means a lot to me, truely.

Just two more minutes, please.
In my network, I got input matrix IN (208x208x32), weight matrix W (64x32x3x3), they are both dense matrices and are stored in dense format, not pruned.
I implemented this convolution layer with CUDNN, and the execution time is (time_cudnn) ms for instance.

Then I pruned my network, resulting in a weight matrix that has roughly 60% non-zero values. This weight matrix is what I meant sparse weight matrix.
Then I thought I could convert this sparse weight matrix to CSR format, and then use CUSPARSE library to calculate the convolution, that might be more efficient and faster. That’s what the idea comes from.
I thought CUSPARSE and matrix stored in CSR format will save a lot of computation for zeros values.(Cause if they are both stored in dense format, then even the weight matrix is all zeros, it would still be calculated, and thus consume the same amount of time).
That’s also the reason why I didn’t try to feed the input and pruned weight matrix (stored in dense format) directly into CUDNN.

So I used

cusparseSdense2csr()

to convert W to W_csr, the sparse weight matrix stored in CSR format.
Then I called

cusparseScsrmm()

(with some transform of IN and W_csr)to calculate the convolution layer, but the execution time is much longer than time_cudnn. That’s why I’m wondering whether CUSPARSE is parallel enough…

So which point in my idea is wrong? And how may I accelerate the calculation of pruned convolution layer?

Thank you so much in advance.

using cusparse instead of cublas (or cudnn, which is essentially a front-end translator for cublas) will not help you if you have 40% zero values. cublas (or cudnn) will always be faster in that case.

It won’t be. CUSPARSE is a lot less efficient than CUBLAS both in terms of memory access and computational efficiency. 40% zeroes does not make up for this.

You need something like 99% zeroes before cusparse becomes a win over cublas for matrix multiply of the same matrix size, where either could be used. If you don’t believe me, code up a simple test yourself. It’s not difficult to do – easier than all the work you’ve already done to convert your weight matrix for use by CUSPARSE.

Even then, CUSPARSE would typically be used to solve large sparse systems where the equivalent dense version of the matrix would not even fit in memory. That is the real win of CUSPARSE.

I’ve stated essentially the same message 3 or 4 times now. It’s OK if you don’t agree with me or don’t believe me. Carry on. But I can’t tell you how to use CUSPARSE to run faster than CUDNN.

I believe you, it’s just that I’m neither familiar with cuSPARSE nor cuDNN, it was just an idea, so I digged in.
Your suggestion is that I still use cuDNN with my pruned network stored in dense format, right?

Then I don’t understand why pruning could make network faster…Since CPU/GPU still needs to multiply and add zeros, this doesn’t save any CPU/GPU clocks…

Or maybe CUDNN/CUBLAS handles this situation?

my understanding of pruning is that it actually removes neurons, not just setting particular weights to zero:

https://jacobgil.github.io/deeplearning/pruning-deep-learning

in that context, pruning reduces the dimensions of the matrices. That is a win, whether you use cudnn, cublas, or cusparse. smaller matrices = faster

therefore pruning is not just setting particular weights to zero, it is setting whole rows or columns of your matrix to zero, therefore the whole row or column can be deleted, thus reducing matrix size, and therefore computational intensity, pretty much regardless of which method you use.

Thank you for your great help, this helped me a lot, especially a better knowledge for CUSPARSE.
This thing has troubled me for nearly a week…

Could you confirm that CUDNN/CUBLAS doesn’t handle sparse matrices? Meaning with sparse matrices, those zeros will still consume CPU/GPU clocks

I’ll do some dig-in on pruning, might have to come back to you again xD

Like any other implementation of BLAS, CUBLAS is designed for use with dense matrices. At what degree of sparsity (percentage of zero elements) a switch to CUSPARSE is indicated, I couldn’t tell you; it will depend on the specific use case. When in doubt, experiment.

Thank you for your responses.
I tried cuDNN with an empty weight matrix, but it consumes the same amount of time.
I’m curious, so how does cuSPARSE work? Does it still need to convert CSR matrices to dense format and then calculate?

The source code to CUSPARSE is proprietary, so your guess is good as mine as to how it works under the hood. My general understanding of sparse methods is that no intermediate conversion to dense storage formats takes place.

Yeah, that’s what i’m expecting as well. However,from what i’ve tested, CUSPARSE takes roughly 5x more time to calculate CSRxDENSE (40% zeros), and the measurement starts after im2col already.