About how to control thread in cuSPARSE

leehyejin0239 · June 12, 2025, 6:43am

Hello,

I’d like to execute sparse matrix-vector multiplication using CUDA on C++.

I’ve transformed 2d-array ( non-zero element percentage of 63%) to CSR, and utilized that format with that operation.

Specifically, I allocated block per row, made each thread implement element-wise multiplication, and each shared memory (per block) add up that results.

These method looks quite efficient.

all of sudden, I’d like to know whether using specific library of CUDA ( e.g. cuSPARSE) is more efficient.

I don’t know how cuSPARSE implemented SpMV specifically.

how that library ( i mean cuSPARSE ) allocate thread (or block) to improve efficiency of SpMV?

In your view of programmer, what is a more efficient way of SpMV?

pmpakos · June 12, 2025, 7:59am

Hello.

cuSPARSE-SpMV operates as a black box, and all you can do is peek through NVIDIA’s tool (Nsight Compute) on what kernels it uses underneath. In there, you will see that it uses 2 kernels, with the first one partitioning the work, and the second one doing the heavy computational work, creating a number of threads that is proportional to the number of nonzeros of the matrix.
More specifically, if I recall correctly, in CUDA 12.5 it assigns 5 nonzeros per CUDA thread, and in a previous version (CUDA 11? maybe) it assigned 4 nonzeros per thread.

So if you would like an efficient way to execute SpMV on GPUs for your custom kernel, you should follow the guideline to create threads that are proportional to the number of nonzeros of the matrix. I do not know if the thread block size plays any significant role in all this.

leehyejin0239 · June 12, 2025, 8:27am

Thanks for your quick answer. I didn’t catch an important part. However, because of accuracy issue, i had to have the order of summation deterministic ( not completely but quite ordered). I should think about some consideration…

leehyejin0239 · June 12, 2025, 8:43am

I have some additional question.

what function is used to sum up results per row in cuSPARSE while implementing SpMV using CSR matrix. ???
when using CSR format, a row which doesn’t have non-zero elem can be passed ( by thread ) ??? i think some threads can be wasted…

for example ) A.T @ x = y ( result vector ), where A is a matrix (m x k), x is a vector(k x 1) and y is a vector(m x 1).

cf) when i use atomicadd (for summation), computational accuracy decreased. back then, data type of matrix is float32

Robert_Crovella · June 12, 2025, 1:32pm

A typical sparse percentage (i.e. percentage of non-zero values) for cusparse methods to be interesting might be 1% or less.

I don’t think cusparse will be interesting from a performance perspective at 63% non-zero values.

If it were my code, and already had an implementation of “sparse” matrix-vector multiply, and the sparse percentage was 63%, I would probably compare it for performance against cublas<T>gemv, with a dense matrix realization.

eedwards · June 12, 2025, 8:21pm

Hi. I work on cuSPARSE. I agree with everything that’s been said in here. The best SpMV for your matrix is almost-surely going to be a dense GEMV from cuBLAS.

I don’t want to get into the internals of how cuSPARSE SpMV works. The implementation is complicated. If you’re interested in the topic in general, then “spmv load balancing” is a good thing to search for in Google Scholar.

leehyejin0239 · June 13, 2025, 1:35am

Thanks! A few months ago, I employed CUPY… which used more than 40GB memory with 12 of 2d array and some operation…!
So i tried to SpMV to address memeory issue and the speed of computation!
According to your comment, it looks useful to address of my issue( memory & speed ) due to difference of cupy & CUDA(especially in C++), Currently, i don’t know how cupy and CUDA control when implementing matrix-vector multiplication yet. it is sure than i should try some library…!

leehyejin0239 · June 13, 2025, 1:43am

Thanks for your useful comment. i should try to implement cuBLAS!! Compring with CUPY, it looks like CUDA control memory more efficiently.,! ( IDK specifically how to control )
Anyway, i aprreciate your quick comment.

Topic		Replies	Views
Nvidia-smi does not show GPU use while using cusparse library methods for CSR SpMV operation GPU-Accelerated Libraries	1	499	August 10, 2019
SpMV implementation problem CUDA Programming and Performance	3	4151	April 17, 2009
SpMV library (cusp, cusparse) CUDA Programming and Performance	7	5791	December 1, 2011
multi-threading with cuSPARSE lib GPU-Accelerated Libraries	15	1314	November 10, 2017
About Hardware Memory Compression GPU-Accelerated Libraries cusparse	4	144	August 7, 2024
cusparseCsrmvEx with half I/O and float calculation. GPU-Accelerated Libraries	5	853	October 30, 2019
CUSP and Cusparse are slower than CPU GPU-Accelerated Libraries	0	449	July 24, 2019
cuSPARSE control number of thread blocks executed GPU-Accelerated Libraries cublas , cusparse	5	575	May 3, 2023
cuSPARSE Matrix-Vector multiplication in half precision runs slower compared to single precision CUDA Programming and Performance cuda	6	748	January 26, 2021
CUSPARSE implementation of SpMV GPU-Accelerated Libraries cusparse	3	1208	June 21, 2022

About how to control thread in cuSPARSE

Related topics