About how to control thread in cuSPARSE

Hello,

I’d like to execute sparse matrix-vector multiplication using CUDA on C++.

I’ve transformed 2d-array ( non-zero element percentage of 63%) to CSR, and utilized that format with that operation.

Specifically, I allocated block per row, made each thread implement element-wise multiplication, and each shared memory (per block) add up that results.

These method looks quite efficient.

all of sudden, I’d like to know whether using specific library of CUDA ( e.g. cuSPARSE) is more efficient.

I don’t know how cuSPARSE implemented SpMV specifically.

how that library ( i mean cuSPARSE ) allocate thread (or block) to improve efficiency of SpMV?

In your view of programmer, what is a more efficient way of SpMV?

Hello.

cuSPARSE-SpMV operates as a black box, and all you can do is peek through NVIDIA’s tool (Nsight Compute) on what kernels it uses underneath. In there, you will see that it uses 2 kernels, with the first one partitioning the work, and the second one doing the heavy computational work, creating a number of threads that is proportional to the number of nonzeros of the matrix.
More specifically, if I recall correctly, in CUDA 12.5 it assigns 5 nonzeros per CUDA thread, and in a previous version (CUDA 11? maybe) it assigned 4 nonzeros per thread.

So if you would like an efficient way to execute SpMV on GPUs for your custom kernel, you should follow the guideline to create threads that are proportional to the number of nonzeros of the matrix. I do not know if the thread block size plays any significant role in all this.

1 Like

Thanks for your quick answer. I didn’t catch an important part. However, because of accuracy issue, i had to have the order of summation deterministic ( not completely but quite ordered). I should think about some consideration…

I have some additional question.

  1. what function is used to sum up results per row in cuSPARSE while implementing SpMV using CSR matrix. ???
  2. when using CSR format, a row which doesn’t have non-zero elem can be passed ( by thread ) ??? i think some threads can be wasted…

for example ) A.T @ x = y ( result vector ), where A is a matrix (m x k), x is a vector(k x 1) and y is a vector(m x 1).

cf) when i use atomicadd (for summation), computational accuracy decreased. back then, data type of matrix is float32

A typical sparse percentage (i.e. percentage of non-zero values) for cusparse methods to be interesting might be 1% or less.

I don’t think cusparse will be interesting from a performance perspective at 63% non-zero values.

If it were my code, and already had an implementation of “sparse” matrix-vector multiply, and the sparse percentage was 63%, I would probably compare it for performance against cublas<T>gemv, with a dense matrix realization.

1 Like

Hi. I work on cuSPARSE. I agree with everything that’s been said in here. The best SpMV for your matrix is almost-surely going to be a dense GEMV from cuBLAS.

I don’t want to get into the internals of how cuSPARSE SpMV works. The implementation is complicated. If you’re interested in the topic in general, then “spmv load balancing” is a good thing to search for in Google Scholar.

1 Like

Thanks! A few months ago, I employed CUPY… which used more than 40GB memory with 12 of 2d array and some operation…!
So i tried to SpMV to address memeory issue and the speed of computation!
According to your comment, it looks useful to address of my issue( memory & speed ) due to difference of cupy & CUDA(especially in C++), Currently, i don’t know how cupy and CUDA control when implementing matrix-vector multiplication yet. it is sure than i should try some library…!

Thanks for your useful comment. i should try to implement cuBLAS!! Compring with CUPY, it looks like CUDA control memory more efficiently.,! ( IDK specifically how to control )
Anyway, i aprreciate your quick comment.