cusolverDnCgesvd performance vs MKL

I am testing some of the new Cuda Dense capabilities in Cuda 7.0. I am finding the SVD to be extremely slow compared to MKL. For example, in the code snippet below I load up a 1856 by 1849 complex matrix and perform an SVD. MKL can do the SVD in 2.2 sec wall-clock time. It takes cusolverDnCgesvd a whopping 41.3 sec wall-clock time. The reduction appears to be correct in both cases. Are there special compiling flags needed to achieve good performance with the new Cuda Dense functions?

Platform : Ubuntu 14.04, GeForce GTX 690, Driver 346.46, MKL bundled with Intel 15.0, 12 CPU threads.

int main()
{
  cusolverDnHandle_t cuHandle;
  cusolverDnCreate(&cuHandle);

  int m = 1856;
  int n = 1849;
  int ldA = m;
  int minMN = min(m,n);
  complex<float>* h_A = (complex<float> *)malloc(m * n * sizeof(complex<float>));
  _readMatrix(m, n, h_A, "./VFull.dat");

  cuComplex* d_A;
  gpuErrchk(cudaMalloc(&d_A, m * n * sizeof(cuComplex)));
  gpuErrchk(cudaMemcpy(d_A, h_A, m * n * sizeof(cuComplex), cudaMemcpyHostToDevice));

  int ldUsvd = ldA;
  cuComplex* d_Usvd;
  gpuErrchk(cudaMalloc(&d_Usvd, m * m * sizeof(cuComplex)));
  int ldVsvdH = n;
  cuComplex* d_VsvdH;
  gpuErrchk(cudaMalloc(&d_VsvdH, n * n * sizeof(cuComplex)));
  float* d_Ssvd;
  gpuErrchk(cudaMalloc(&d_Ssvd, minMN * sizeof(float)));

  cusolverStatus_t status;

  int workSize = 0;
  status = cusolverDnCgesvd_bufferSize(cuHandle, m, n, &workSize);
  if (status != CUSOLVER_STATUS_SUCCESS )
    cout << "Initialization of cuSolver failed." << endl;

  int *devInfo;
  gpuErrchk(cudaMalloc(&devInfo, sizeof(int)));
  cuComplex* work;
  gpuErrchk(cudaMalloc(&work, workSize * sizeof(cuComplex)));
  float* rwork;
  gpuErrchk(cudaMalloc(&rwork, workSize * sizeof(cuComplex)));

  status = cusolverDnCgesvd(cuHandle, 'A', 'A', m, n, d_A, ldA,
    d_Ssvd, d_Usvd, ldUsvd, d_VsvdH, ldVsvdH, work, workSize, rwork, devInfo);

  switch(status)
  {
    case CUSOLVER_STATUS_SUCCESS:
      cout << "success" << endl;
      break;
    case CUSOLVER_STATUS_NOT_INITIALIZED :
      cout << "Library cuSolver not initialized correctly" << endl;
      break;
    case CUSOLVER_STATUS_INVALID_VALUE:
      cout << "Invalid parameters passed" << endl;
      break;
    case CUSOLVER_STATUS_INTERNAL_ERROR:
      cout << "Internal operation failed" << endl;
      break;
    case CUSOLVER_STATUS_EXECUTION_FAILED:
      cout << "Execution failed" << endl;
      break;
  }

  cudaDeviceSynchronize();

  cusolverDnDestroy(cuHandle);

  return 0;
}

Can anybody comment on performance of Cuda Dense SVD and UNMQR in Cuda 7.0? My company has several applications we were planning to migrate over to the new Cuda Dense capabilities in Cuda 7.0 and all studies thus far show the performance of these two routines is far inferior to MKL. I want to rule out implementation error or compilation error before we give up. Please help?

If you’re simply looking for a shoot-from-the-hip answer, I don’t have it. Perhaps someone else does.

If you want to provide the following, I will take a closer look, and spend more of my time considering this:

  • a complete code that I can copy, paste, compile, and run, without having to add anything or change anything, that includes timing output, and includes the code that measures the test on MKL vs. the test on CUSOLVER. Please also include a demonstration of results validation (matching) between CUSOLVER output and MKL output.

  • provide the compile command line

  • provide the exact execution command line, plus any relevant environment parameters e.g. MKL_NUM_THREADS

  • provide your actual host CPU configuration (I assume you have 12 cores? Is that a single CPU or dual-socket? what are the CPU(s), exactly?)

If all of this is too much trouble, please accept my apologies and disregard my post. Perhaps someone else will know the answer you are looking for.

Well, for now I was looking for someone to comment on the code snippet above. Other than the matrix read, is it implemented how it was intended (the numbers appear to be correct)? Also, what would be the most optimal way to compile that code snippet on the platform stated? That shouldn’t take really much time at all - the code is so simple - just a simple Makefile to optimally compile that simple code. If someone can answer those two questions, and if it requires any changes by me, then I will gladly make the changes and re-run the test. If the test still shows outrageous timing differences (40 sec vs 2 sec is pretty bad) then I will look into providing a detailed testing package per your instructions. However, I hope Cuda developers have unit tests that they have run showing speed improvements over what the community typically uses (MKL is a big player here for sure.) Perhaps I can replicate their tests and I would be more than happy to try. I have many GPU cards to choose from and I can build a box configuration if need be.

This issue isn’t about a close race, but more of no competition at all. I just want to make sure I am setting things up as intended by Cuda Dense. That is why I am stating wall-clock time. I can just look at the seconds hand on my computer clock and compare - if it was a tight race I would put code timing in there but not when it is 40 sec vs 2 sec.

Which exact CPU configuration are you using?

That GTX 690 is a few years old and is a 2-in-1 GPU. The application is only using one of the two, which is not fully taking advantage of the computational capability.

Also the 690 is not a great GPU for compute, as this list shows:

http://www.videocardbenchmark.net/high_end_gpus.html

To make sure you GPU is working correctly run the CUDA-Z utility, you should see and output like this:

http://imgur.com/DlqMv0s

If I remember correctly the usual SVD implementation is inherently more serial than other linear algebra sub-routines such as sgemm(). Also I have never used that library, rather worked more with cuBLAS and MAGMA.

MAGMA is known to be the fastest library and it does have a SVD routine.

http://icl.cs.utk.edu/projectsfiles/magma/pubs/25-MAGMA_1.3_SC12.pdf

They directly compare to MKL and have about a 4X performance boost over MKL (at least for LU decomp, but other comparison charts are shown).

I have several boxes with 4xK10 in each. They are all in use right now though but I hope to get one by this weekend. I will run the problem on a K10, which is typically smoking fast on floats. I will post the compare of the K10 then.

690 checks out ok though.

SVD is new to Cuda 7.0 and Cuda Dense. I initially started with Cula, but I thought Cuda just rolled Cula dense stuff into Cuda 7.0. Cula claims it is better than MKL. Maybe my wires are crossed here.

I have several open issues with MAGMA too. I tried it several years ago and found a boat load of stability problems - I posted some in the MAGMA forums. I have stayed clear of MAGMA ever since. Maybe it is more stable now. It is a thought.

I will follow up with more K10 data later.

From your original post I cannot discern how your timing is being done, could you please post your entire application, which we might compile?

Ok, I have run the test on a K10 using a random matrix. The results are similar - I even set OMP threads to 1 for the Intel example. Thus, the compare is a single GPU from a K10 form factor (CUDA) versus a single thread (MKL). The complete applications I ran, including Makefiles, are attached below. I use icpc for the compiler on each as can be seen in the Makefiles. As stated before, I just use simple wall-clock time, and I put that snippet of code in each application.

I hope this is a simple compilation problem. Please help.

— Timings —

MKL LAPACKE_cgesvd wall-clock time (1 OMP thread): 10.5 sec
cusolverDnCgesvd wall-clock time (1 K10 GPU): 68.5 sec
CUDA_SVD.tar.gz (4.46 KB)
MKL_SVD.tar.gz (1.17 KB)

Is this issue worthy of a developer’s attention? It is technically not a bug in the sense I am getting wrong answers or crashing. As stated before I ran clean, and the SVD appears to be accurate using relative norm : ||A - USVH|| / ||A||. However, I would think the objective here is to offer dense capabilities (SVD, QR, LU, etc…) that strongly compete, actually exceed, typical CPU paradigms (e.g., MKL). Cuda Dense appears to be falling far short of advertisements. My company had high expectations for these new capabilities and we were preparing to set aside large chunks of development effort. It is an important decision for us so any advice or help is greatly appreciated.

I’m able to reproduce your timing results.

I’ve reviewed this with the development team that is responsible for this library at NVIDIA.

In short, if you’re happy with MKL, you should use MKL.

This particular operation, at this time, is not intended to compete performance-wise with MKL. It exists for situations where MKL may not be available, and also as an indicator for future library directions for developers who may be developing on alternate platforms. In addition, due to the overhead of data transfer, it may provide some utility for cases where the data transfer time/overhead is large compared to the SVD calculation time (obviously not the case here.)

Although it’s possible that the performance of this library call may improve in the future, there are no plans to make any investments along these lines in the next 9 months or so.

If you desire personalized communication or wish to confirm any of the above, please file a bug at developer.nvidia.com and reference NVIDIA internal bug number 1645592. I probably won’t be able to respond to further specific or detailed questions about this, and in general probably won’t have much else to say about it.

Apologies for any confusion this may have caused. Since you reference advertisements, if you are aware of advertisements that indicate that cusolver dense is faster than MKL across the board, or with respect to this specific function, I’d be interested in seeing those.

I will not file a bug. I am surprised this is the intent for the library though. The spirit of CUDA is to accelerate. If the Dense package isn’t doing that then I think fair warning should be given. I was fooled, and I think others would be fooled as well. If you go here …

… there is verbiage along the lines of the CUDA acceleration spirit suggesting cuSOLVER is an accelerator. I believe we have established it is not, in general. I would recommend verbiage along what you stated above be added to the advertisements of cuSOLVER so that others may know the current intent of the library.

BTW, CULA Dense does speed up SVD. I have re-run several examples using the CULA Dense free download and for a reasonable size matrix CULA is indeed an accelerator. I suppose a follow up question would be why does CULA Dense accelerate but CUDA Dense does not?

Never-the-less, I am grateful you took the time to answer my questions and clarify the package.

I understand the confusion here. We do provide a decent speedup for LU, QR and LDL^t factorizations, which is what we would like to say for SVD as well. Our purpose with cuSOLVER is to provide dense and sparse direct solvers as part of the CUDA toolkit for the first time; we have to start somewhere. Since CULA is no longer supported, we felt it was urgent to get some functionality into the hands of developers in CUDA 7.0.
Since CUDA runs on more that x86 host CPUs these days, cuSOLVER fills a need where there is no MKL.
That being said, we can do better with SVD, but it will have to wait for the next CUDA release, priorities and timelines being tight already.

Joe Eaton, manager of cuSOLVER library

What was the you had with problem with MAGMA’s magma_cgesvd() ?

http://icl.cs.utk.edu/projectsfiles/magma/doxygen/group__magma__gesvd.html

http://icl.cs.utk.edu/projectsfiles/magma/doxygen/group__magma__cgesvd__driver.html

There have been recent updates to both the MAGMA project and the documentation. I would imagine it is faster than cuSOLVER, but you should be able to find out.

It is good enough for the US Department of Energy evidently.

Joe,

Firstly, thanks for your response. However, our company unfortunately has the same problem as mh1’s. And I guess that is pretty general.

We expect speedup from GPU in dense/sparse matrix operations (SVD, QR, LU) and our company have invested several servers with K40 installed for testing purpose. Currently, preliminary performance comparisons with MKL (using cuSolver) caste doubt on the decision to invest more.

BTW, do you have any clue that when NVIDIA will launch the next generation Tesla product? K80 is not as good as K40 regarding the single GPU capacity.

Again, why is no one considering MAGMA? Have you compared MKL to MAGMA?

http://icl.cs.utk.edu/magma/

It just had a significant update 9 days ago and they use this daily on the Titan supercomputer in Tennessee.

We are moving to MAGMA.

BTW, do you have any suggestions about how to conveniently build and run a test? We are using Windows 7 and VS2010. Many thanks in advance!

try this guide;

https://github.com/maxhutch/magma/blob/master/README-Windows

Thank you!

I have just made a simple test of a 1856 by 1849 single-precision SVD on a system with a Titan X, but it is still slow under CUDA 7.5:

Eigen 3.3 JacobiSVD:  ~145s
Eigen 3.3 BDCSVD:       ~1.7s
cuSOLVE 7.5:           ~23s
MAGMA 2.0 sgesvd:      ~21s
MAGMA 2.0 sgesdd:       ~1.0s

Results above are all single-threaded or single-GPU. I don’t have access to MKL unfortunately.

Is there any news whether SVD performance will be improved in CUDA 8?

The performance is expected to improve in CUDA 8 w.r.t. MKL LAPACKE. This should be discoverable if you have access to CUDA 8 EA. If not, CUDA 8 should go to a more publically available RC status (“soon” - in the next 30-60 days as an estimate at this time).

It’s still not expected to be faster than MKL.

I can’t speak to the performance vs. Eigen, but presumably Eigen may use MKL under the hood.

In any event, it’s still not expected to be faster than fast CPU implementations.