Having spent some time recently writing sparse linear algebra sub-routines, it is important to note that some of the algorithms have quite a bit of inherent serial dependencies. The trick is to use pre-processing routines (combinatorial) to reorganize the input matrix into a form (and a series of ordered processing levels) which can be computed in parallel in a larger outer CPU loop.

It is possible to beat the CPU multi-thread implementations by as much as 2-3 times for applications such as sparse LU or sparse Cholesky factorization, but you may have to ‘roll your own’ implementation. NVIDIA does a good job of providing some free functionality in their SDK, but they too have limited resources. I expect as time goes on there will be incremental improvements to cuSparse, cuBlas, cuSolver and MAGMA, all of which are free (unlike MKL to my understanding, correct me if I am wrong).