CUSPARSE much slower than scipy.sparse?

Hi,

I am having issues making a sparse matrix multiplication work fast using CUSPARSE on a linux server. Initially, I was calling CUSPARSE via the accelerate.sparse python module. The runtime I get for a X^T*X calculation for X of size (678451, 1098) with accelerate is 30 times that of scipy (11.66s vs 0.39s), in contradiction with NVIDIA’s reports that CUSPARSE is several times faster than MKL. Thinking that the problem was in the accelerate wrapper, I tried calling the C++ CUSPARSE cusparseDcsrgemm function directly but still got the same kind of performance. For a bigger matrix CUSPARSE performed even worse than scipy. The number of non-zeros in the matrix is 5556733 (i.e. the matrix density is 0.0075). Sparse matrix times dense vector multiplication is also much slower using CUSPARSE than scipy.

What could be going on? The GPU card is a Tesla K40m. No other processes were running on the server. The runtimes are reproducible; CUSPARSE always much slower than scipy. The CUDA library I am using is 8.0. Monitoring GPU usage with nvidia-smi, I can see that the GPU is indeed used. It doesn’t seem that this is a hardware issue.

One other data point: CUBLAS does perform faster than numpy. So, the problem I am having is with CUSPARSE, not CUBLAS.

lscpu output for the linux server is as follows:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel® Xeon® CPU E5-2630 v2 @ 2.60GHz
Stepping: 4
CPU MHz: 1200.000
BogoMIPS: 5186.77
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-5,12-17
NUMA node1 CPU(s): 6-11,18-23

I’d appreciate your feedback. Thanks a lot!

This description is very confusing to me. Best I can establish, your application initially invoked CUSPARSE via the accelerate.sparse Python module, but now your app is invoking CUSPARSE directly and as a result performance has dropped.

If so: Since CUSPARSE is used for the bulk of the work in either variant, the performance issue would not appear to be with CUSPARSE itself, but rather with how it is configured and what sequence of APIs are called. Could you instrument accelerate.sparse to find out how it maps work to CUSPARSE?

It is also not clear what performance comparison data you are looking at. Speedup of CUSPARSE “relative to MKL” will likely depend on what operation is performed, on size and shape of the matrices involved, the sparsity and storage format of the matrices, the CPUs and GPU used to run the code.

My intention is to use accelerate.sparse, all python code. I know that accelerate.sparse is a wrapper around CUSPARSE. Since I was getting bad performance, in order to rule out issues with accelerate.sparse, I tried calling CUSPARSE directly, just for a test. Again, in the end I just want to just use accelerate.sparse, not both.

Since I got the same performance either using accelerate.sparse or directly using CUSPARSE, I think the problem I have is with CUSPARSE, as accelerate.sparse is just a wrapper.

What I want is to see accelerate.sparse run faster than scipy.sparse. I believe scipy.sparse uses MKL, hence my reference to MKL. Here is an NVIDIA link says that CUSPARSE is many times faster than MKL: https://developer.nvidia.com/cusparse.

I am using CSR format (I tried CSC too). I have tried different sizes of matrices, all of them very sparse. scipy.sparse always ran much faster than CUSPARSE in my tests.

Seems I misunderstood the situation. The solution seems simple then: Keep using scipy.sparse

Thanks for your replies. All of this started because I wanted to improve the performance of my application. scipy.sparse is fast but I need my application to run even faster, as with much bigger data even scipy.sparse would not be fast enough. So I tried CUSPARSE but got the unexpected results I described, with small or big data.

Hi, does anybody from NVIDIA have any input on this? I have .npy (numpy array) files that can be used to demonstrate the issue. Thanks

It seems that one of the claims you are making is that the NVIDIA performance data on CUSPARSE is inaccurate, e.g. referring to this link:

https://developer.nvidia.com/cusparse

What I see at that link is a claim that CUSPARSE is around 2x-5x faster than MKL for the stated configuration (K40m vs. single 12-core Ivybridge, MKL 11.0.4) , for spmv operation, for certain example sparse matrices.

If you want to dispute that, and provide your test case for comparison of CUSPARSE to MKL, written in C or C++, I will take a look.

I’m not able to pursue or comment on scipy, numpy, numba, accelerate, or any products like that, as they are not NVIDIA products and we don’t control them.

I’ll need a complete test case or I won’t be able to pursue anything. Also, I work on things like this as time permits, and other than filing a bug in what I consider to be the case of a reproducible concern, I can’t offer any definite remedies or guaranteed responses.

You could file an enhancement request with NVIDIA. The proper venue for this is the bug reporting form linked from the CUDA registered developer website. Just prefix the bug report’s synopsis with "RFE: ", to mark it as an enhancement request rather than a report for a functional bug.

You would want to attach the smallest possible self-contained code that reproduces the performance issues of concern while using the latest released version of CUDA.

I would suggest following the recommendation by njuffa for anything beyond the very narrow lily pad I traced out.