Hi,

I am having issues making a sparse matrix multiplication work fast using CUSPARSE on a linux server. Initially, I was calling CUSPARSE via the accelerate.sparse python module. The runtime I get for a X^T*X calculation for X of size (678451, 1098) with accelerate is 30 times that of scipy (11.66s vs 0.39s), in contradiction with NVIDIA’s reports that CUSPARSE is several times faster than MKL. Thinking that the problem was in the accelerate wrapper, I tried calling the C++ CUSPARSE cusparseDcsrgemm function directly but still got the same kind of performance. For a bigger matrix CUSPARSE performed even worse than scipy. The number of non-zeros in the matrix is 5556733 (i.e. the matrix density is 0.0075). Sparse matrix times dense vector multiplication is also much slower using CUSPARSE than scipy.

What could be going on? The GPU card is a Tesla K40m. No other processes were running on the server. The runtimes are reproducible; CUSPARSE always much slower than scipy. The CUDA library I am using is 8.0. Monitoring GPU usage with nvidia-smi, I can see that the GPU is indeed used. It doesn’t seem that this is a hardware issue.

One other data point: CUBLAS does perform faster than numpy. So, the problem I am having is with CUSPARSE, not CUBLAS.

lscpu output for the linux server is as follows:

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 24

On-line CPU(s) list: 0-23

Thread(s) per core: 2

Core(s) per socket: 6

Socket(s): 2

NUMA node(s): 2

Vendor ID: GenuineIntel

CPU family: 6

Model: 62

Model name: Intel® Xeon® CPU E5-2630 v2 @ 2.60GHz

Stepping: 4

CPU MHz: 1200.000

BogoMIPS: 5186.77

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 15360K

NUMA node0 CPU(s): 0-5,12-17

NUMA node1 CPU(s): 6-11,18-23

I’d appreciate your feedback. Thanks a lot!