cusolverDnCgesvd performance vs MKL

Having spent some time recently writing sparse linear algebra sub-routines, it is important to note that some of the algorithms have quite a bit of inherent serial dependencies. The trick is to use pre-processing routines (combinatorial) to reorganize the input matrix into a form (and a series of ordered processing levels) which can be computed in parallel in a larger outer CPU loop.

It is possible to beat the CPU multi-thread implementations by as much as 2-3 times for applications such as sparse LU or sparse Cholesky factorization, but you may have to ‘roll your own’ implementation. NVIDIA does a good job of providing some free functionality in their SDK, but they too have limited resources. I expect as time goes on there will be incremental improvements to cuSparse, cuBlas, cuSolver and MAGMA, all of which are free (unlike MKL to my understanding, correct me if I am wrong).

There is some discussion about the (sub-optimal, when compared with MKL) performance of SVD on the GPU at

As indicated in the referenced paper in that discussion, the classical Golub-Reinsch algorithm seems to be tough to port efficently to the GPU, due to serial dependencies.

There is some research on parallel methods for SVD computation (just google ‘parallel SVD’) - see , , , , and (quite recent and looks interesting)

Sorry intrude, I’m not so good with algebra linear, how can model a linear system that solves coefficients of matrix limiting the values of coefficients (i.e., -1 <= a <= 1, b == 1, c >= 0), it’s possible?