slow performance cusparse spmv

Hi all,

I’m trying to implement a spmv for a sparse matrix (doubles) and I’m getting a really slow performance with cuda in general.

I’ve tried the following implementations:

  • Naive code for csr format
  • warped code for csr format
  • OpenCL naive code for csr format
  • cusparseDcsrmv method
  • convert from csr to hyb (cusparseDcsr2hyb) and use the cusparseDhybmv

In all options the performance is worse than a single threaded naive code for the CPU, I’m using pinned memory for cuda, and double4 with loop unrolling for OpenCL.

My matrices are:

  1. number of rows=900,690 non-zero elements=34,803,656
  2. number of rows=436,097 non-zero elements=35,241,024
  3. number of rows=11,374,798 non-zero elements=202,029,991

These spmv operations are a part of a gmres method, and they are called many times.
My GPU is a Tesla k20m, I’m using the driver version 319.37, and the toolkit 5.5.22. The gpu-util shown by nvidia-smi is at most 7%.

Is there something obvious that I’m not seeing ? Or any condition that I should be aware of?
(I don’t know if I supplied enough info, if not please ask for something more)

Thanks in advance.

Just to confirm: Once you have downloaded the data to the GPU, you are keeping it resident on the GPU for the entire duration of GMRES?

In general, SPMV is bandwidth limited, so the performance should roughly reflect the relative memory throughputs of the CPU and the GPU which should be a factor greater than four.

Yes I am keeping the csr matrix on the GPU for the entire duration, and I’m only copying the first vector (x) and the result vector back from the GPU at each iteration.

Some runtime examples (whole method):
matrix 1 - CPU: 2s GPU(cusparse csr):3.3s
matrix 2 - CPU: 18s GPU(same): 18s
matrix 3 - CPU: 319s GPU(same): 320s

Converting the matrix to hyb has the same runtimes as the csr (roughly)


Have you done some profiling on the app (both CPU side and GPU side)? Not knowing any details about your app, the fact that the CPU vs GPU times above seem to match within noise level across multiple problem sizes suggests some sort of bottleneck common to both cases, maybe I/O to read the original data.

I think I was misunderstood. When I said whole method time I meant whole gmres method (only the solving part).

I’m not taking into account the times related to the other steps.

Even so, the fact that CPU and GPU performance scale exactly the same, and are in fact the same within noise level, looks somewhat improbable. If this were my code, I would be digging in with a profiler to get an exact accounting of where the time is spent. My working hypothesis is that most of the time in the GPU version is not actually spent doing GPU work.

When I first implemented my CG methods on cuda I did not immediately notice a big speed increase. Some of the slowdown comes from strange areas - the dot product in cublas, for example, is seriously under-optimized. The tridiagonal solve from cusparse, while very quick, appears to be a synchronous call, so will hinder multi-gpu efforts. Your first step, however, should be to check operator error - ie that the residuals are identical (or at least very very close) for the first 10-20 iterations. This is almost certainly operator error, which is kind of a double edged sword - you will have to find out what’s going wrong, but when you do, you will be able to run a much faster process :)

Re: The dot product in CUBLAS is seriously under-optimized

Does this statement refer to a particular GPU architecture(s), version (S,D,C,Z), or a particular configuration (e.g. stride or vector length), or generally? What is the performance observed versus the expected performance? Has a bug been filed?

Keep in mind that my info is about 6months out of date. However…

When using double precision (D):

Dot product runtime for 800,000 length vector = 10ms

I was running a c2090 and then switched to a k20x and had the runtime INCREASE to 15ms. The only thing I can guess is that it’s only using 1 threadblock per dot product vector to allow multiple dot products at once - the peak runtime for such a calculation should be <<1ms.

Edit: have not posted a bug report as I would have to of course re-verify that data, it would be silly to go posting bug reports on things that may be already fixed.

Thanks for the data. I assume this is a DDOT call where both strides are 1? It is possible that the code had not yet been fully optimized for Kepler in CUDA 5.0, or that some architectural difference between Fermi and Kepler is tripping up this code.

I will look into it; the closest configuration I have at my disposal is a C2050 and a K20c (on the same Linux64 platform, so a controlled experiment). In general it is very helpful when bug reports are filed for both functional and performance issues observed in the field, and we appreciate the help. Despite extensive internal test suites there is always the chance of some combination of the input parameters not being covered.

Sorry, correct, stride==1. Even then, the c2090 performance was running at less than 10% of peak. Threadfence reduction from the cuda samples absolutely destroyed it in performance last I checked.

Thanks. I will look into it and follow-up with the CUBLAS team as necessary.

I don’t see any performance issues with DDOT in either CUDA 5.0 or CUDA 5.5. I tried this on an older workstation running 64-bit RHEL, with a C2050 and a K20c. Both cards had ECC enabled. I measured the STREAM ‘copy’ bandwidth of these GPUs when working on vectors of type double (8 bytes / element):

113 GB/sec for C2050
138 GB/sec for M2090 // for illustration purposes only
150 GB/sec for K20c

I then measured DDOT, using x- and y- increments of 1, and 800,000 elements in each vector. I used the “legacy” version of DDOT which delivers the result back to the CPU and is therefore synchronous. I measured end-to-end time for the cublasDdot() call as seen by the host using a high-precision timer with submicrosecond resolution. I measured GPU times using the CUDA profiler. As expected there are three components to GPU execution times: intra-threadblock reduction, inter-threadblock reduction, device-to-host transfer. The measured times are as follows:

C2050, CUDA 5.0: GPU = 151.5 usec (145.5 + 4.0 + 2.0), end-to-end = 247 usec
C2050, CUDA 5.5: GPU = 152.5 usec (146.5 + 4.0 + 2.0), end-to-end = 246 usec

K20c, CUDA 5.0: GPU = 101.0 usec (93.0 + 5.0 + 3.0), end-to-end = 215 usec
K20c, CUDA 5.5: GPU = 101.0 usec (93.0 + 5.0 + 3.0), end-to-end = 209 usec

So based on the GPU execution times the data throughput on the K20c for DDOT is about 127 GB/sec which seems reasonable compared to the STREAM throughput (which is measured utilizing vectors of 10 million elements, which makes it a bit more efficient).

Hmm don’t i look like an idiot :P

Let me try and reproduce my execution times on my side. Maybe I was doing something wrong :S. Sorry for wasting your time!!!

Hey njuffa,
Finally got around to testing my dotprod issue. It’s more sinister than I thought:
The dotproduct itself is not the issue. This performs well enough - roughly 1% of total execution time of an iteration of my code. What DOES happen is the rest of my code slows down when using cublas’ dot product instead of my custom kernel - and more importantly, each execution of exactly the same code becomes much much much more variable

I do have one question - are repeated initializations of cublas/cusparse handles not recommended?

Here’s the code I’m using. darray is a simple device array class. My own generalized dotproduct code is commented out, obviously.

darray gendot(const darray &vec1, const darray &vec2)
    int m = vec1.dims0;
    int n = vec1.dims1;

    darray out(1,n);
    if ((vec1.dims0==vec2.dims0)&&(vec1.dims1==vec2.dims1))
        cublasHandle_t handle;
	CUBLAS(cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_DEVICE));
	for (int i=0; i<n; i++) cublasSdot(handle, m,*m, 1,*m, 1,;

    } else {
	MSG("Gendot error dims: %d,%d * %d,%d", vec1.dims0,vec1.dims1,vec2.dims0,vec2.dims1);
	throw std::runtime_error("gendot -> dimension mismatch!");
    return out;

Time for iteration 1 : 0.796480 seconds, accuracy so far : 1.163859
Time for iteration 2 : 0.759783 seconds, accuracy so far : 0.497118
Time for iteration 3 : 1.110367 seconds, accuracy so far : 0.391392
Time for iteration 4 : 0.954960 seconds, accuracy so far : 0.154466
Time for iteration 5 : 0.815792 seconds, accuracy so far : 0.093269
Time for iteration 6 : 1.044942 seconds, accuracy so far : 0.084907
Time for iteration 7 : 0.891542 seconds, accuracy so far : 0.036288

Compared to:

Time for iteration 1 : 0.724610 seconds, accuracy so far : 1.163859
Time for iteration 2 : 0.697980 seconds, accuracy so far : 0.497118
Time for iteration 3 : 0.703218 seconds, accuracy so far : 0.391392
Time for iteration 4 : 0.662221 seconds, accuracy so far : 0.154466
Time for iteration 5 : 0.700503 seconds, accuracy so far : 0.093268
Time for iteration 6 : 0.656961 seconds, accuracy so far : 0.084907
Time for iteration 7 : 0.728399 seconds, accuracy so far : 0.036288

Of this, 15 ms is taken up by sdot per iteration in version 1, compared to 5 ms or so in version 2 (this includes cudaDeviceSynchronize() before starting and stopping the timer. BOTH are negligible which makes me wonder why the rest of the code gets wobbly with sdot??