CUSPARSE much slower than scipy.sparse?

captain_picardo · December 16, 2016, 2:15am

Hi,

I am having issues making a sparse matrix multiplication work fast using CUSPARSE on a linux server. Initially, I was calling CUSPARSE via the accelerate.sparse python module. The runtime I get for a X^T*X calculation for X of size (678451, 1098) with accelerate is 30 times that of scipy (11.66s vs 0.39s), in contradiction with NVIDIA’s reports that CUSPARSE is several times faster than MKL. Thinking that the problem was in the accelerate wrapper, I tried calling the C++ CUSPARSE cusparseDcsrgemm function directly but still got the same kind of performance. For a bigger matrix CUSPARSE performed even worse than scipy. The number of non-zeros in the matrix is 5556733 (i.e. the matrix density is 0.0075). Sparse matrix times dense vector multiplication is also much slower using CUSPARSE than scipy.

What could be going on? The GPU card is a Tesla K40m. No other processes were running on the server. The runtimes are reproducible; CUSPARSE always much slower than scipy. The CUDA library I am using is 8.0. Monitoring GPU usage with nvidia-smi, I can see that the GPU is indeed used. It doesn’t seem that this is a hardware issue.

One other data point: CUBLAS does perform faster than numpy. So, the problem I am having is with CUSPARSE, not CUBLAS.

lscpu output for the linux server is as follows:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
Stepping: 4
CPU MHz: 1200.000
BogoMIPS: 5186.77
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-5,12-17
NUMA node1 CPU(s): 6-11,18-23

I’d appreciate your feedback. Thanks a lot!

njuffa · December 16, 2016, 4:47am

This description is very confusing to me. Best I can establish, your application initially invoked CUSPARSE via the accelerate.sparse Python module, but now your app is invoking CUSPARSE directly and as a result performance has dropped.

If so: Since CUSPARSE is used for the bulk of the work in either variant, the performance issue would not appear to be with CUSPARSE itself, but rather with how it is configured and what sequence of APIs are called. Could you instrument accelerate.sparse to find out how it maps work to CUSPARSE?

It is also not clear what performance comparison data you are looking at. Speedup of CUSPARSE “relative to MKL” will likely depend on what operation is performed, on size and shape of the matrices involved, the sparsity and storage format of the matrices, the CPUs and GPU used to run the code.

captain_picardo · December 16, 2016, 5:43am

My intention is to use accelerate.sparse, all python code. I know that accelerate.sparse is a wrapper around CUSPARSE. Since I was getting bad performance, in order to rule out issues with accelerate.sparse, I tried calling CUSPARSE directly, just for a test. Again, in the end I just want to just use accelerate.sparse, not both.

Since I got the same performance either using accelerate.sparse or directly using CUSPARSE, I think the problem I have is with CUSPARSE, as accelerate.sparse is just a wrapper.

What I want is to see accelerate.sparse run faster than scipy.sparse. I believe scipy.sparse uses MKL, hence my reference to MKL. Here is an NVIDIA link says that CUSPARSE is many times faster than MKL: cuSPARSE | NVIDIA Developer.

I am using CSR format (I tried CSC too). I have tried different sizes of matrices, all of them very sparse. scipy.sparse always ran much faster than CUSPARSE in my tests.

njuffa · December 16, 2016, 7:42am

Seems I misunderstood the situation. The solution seems simple then: Keep using scipy.sparse

captain_picardo · December 16, 2016, 12:27pm

Thanks for your replies. All of this started because I wanted to improve the performance of my application. scipy.sparse is fast but I need my application to run even faster, as with much bigger data even scipy.sparse would not be fast enough. So I tried CUSPARSE but got the unexpected results I described, with small or big data.

captain_picardo · January 16, 2017, 8:26pm

Hi, does anybody from NVIDIA have any input on this? I have .npy (numpy array) files that can be used to demonstrate the issue. Thanks

Robert_Crovella · January 16, 2017, 8:45pm

It seems that one of the claims you are making is that the NVIDIA performance data on CUSPARSE is inaccurate, e.g. referring to this link:

[url]https://developer.nvidia.com/cusparse[/url]

What I see at that link is a claim that CUSPARSE is around 2x-5x faster than MKL for the stated configuration (K40m vs. single 12-core Ivybridge, MKL 11.0.4) , for spmv operation, for certain example sparse matrices.

If you want to dispute that, and provide your test case for comparison of CUSPARSE to MKL, written in C or C++, I will take a look.

I’m not able to pursue or comment on scipy, numpy, numba, accelerate, or any products like that, as they are not NVIDIA products and we don’t control them.

I’ll need a complete test case or I won’t be able to pursue anything. Also, I work on things like this as time permits, and other than filing a bug in what I consider to be the case of a reproducible concern, I can’t offer any definite remedies or guaranteed responses.

njuffa · January 16, 2017, 8:49pm

You could file an enhancement request with NVIDIA. The proper venue for this is the bug reporting form linked from the CUDA registered developer website. Just prefix the bug report’s synopsis with "RFE: ", to mark it as an enhancement request rather than a report for a functional bug.

You would want to attach the smallest possible self-contained code that reproduces the performance issues of concern while using the latest released version of CUDA.

Robert_Crovella · January 16, 2017, 8:58pm

I would suggest following the recommendation by njuffa for anything beyond the very narrow lily pad I traced out.

Topic		Replies	Views
Performance Downgrade when changing [deprecated] cusparse<t>csrmm() to cusparseSpMM() GPU-Accelerated Libraries	1	793	August 20, 2019
cuSPARSE Matrix-Vector multiplication in half precision runs slower compared to single precision CUDA Programming and Performance cuda	6	733	January 26, 2021
CUSP losses performance in CUDA RC3.2 CUDA Programming and Performance	5	1566	November 15, 2010
Use of CUSPARSE for AX=B CUDA Programming and Performance	11	7751	July 22, 2013
Massive performance decrease in cuSPARSE spMM from v10.1 to v10.2 CUDA Programming and Performance performance	1	623	January 22, 2021
cuSPARSE generic SpSM much slower than legacy csrsm2 GPU-Accelerated Libraries cublas , cusparse	3	75	February 18, 2025
CUSPARSE_STATUS_INVALID_VALUE when using cusparseSpMM GPU-Accelerated Libraries	3	2296	July 14, 2019
CUSPARSE v2 sparse matrix-matrix multiply issue (CUDA 5.0 preview) CUDA Programming and Performance	1	1285	June 20, 2012
Why does CULASparse solver run much faster on CPU than on GPU? CUDA Programming and Performance	5	1034	September 29, 2014
Problem of two large sparse matrices multiplication in cuParse CUDA Programming and Performance	6	3696	November 21, 2016

CUSPARSE much slower than scipy.sparse?

Related topics