CUBLAS tbsv performance.

I am in the process of revisiting some older linear algebra code of mine with the aim of getting it in shape for CUDA 4.0. In the course of this I though I would try building a GPU version of the lapack routinely DPBTRS using cublasDtbsv directly - so basically a pair of cublasDtbsv calls for forward and backward substitution for each right hand side. The results for a N=20000 matrix with a semibandwidth of 1900 and 4 rhs vectors:

GTX470 Cublas 3.2:   total time = 1.453503, dbptrf = 0.974225, dbptrs = 0.479278

GTX470 Cublas 4.0:   total time = 1.415825, dbptrf = 0.920702, dbptrs = 0.495123

ACML4.4 (4 threads): total time = 5.026893, dbptrf = 4.853224, dbptrs = 0.173669

(all times in seconds, no host device memory copies included in the GPU timing, averaged over 20 runs).

While my GPU band Cholesky factorization is about 5 times faster than the multithreaded ACML version, the CUBLAS dbsv performance is pretty terrible - something like 3 times slower than the CPU version. Looking a the profiler output tells the story - the two cublasDtbsv kernels dtbsv_main_lo_nt, dtbsv_main_lo_tr are using almost 40% of the total GPU time recorded by the profiler.

Anyone else using this routine? Any suggestions for alternatives?

We are aware that TBSV (as well as TRSV) do not perform well on GPU. They are memory-bound as all BLAS2. on top of that, triangular solvers are very difficult to parallellize. Currently TBSV perf are poor but in line with TRSV. I think there is some room for improvement but not that much. Frankly, we did not put too much optimization effort for the BLAS2 routines (except GEMV) because we have not received much request for them. We mainly implemented them more for completeness of the BLAS Library.

In your example, you mentioned that you have 4 rhs. Does it mean that you call it 4 times in a row with the same matrix?
Would you be interested in some “TBSM” which would certainly be more friendly for GPU?

Thanks for the reply. I understand that blas 2 routines don’t leave a lot of scope for performance on the GPU, I just wasn’t expecting the diagonal solvers to quite so slow. I was really wondering about the algorithm the code uses - the occupancy of the tbsv kernels is incredibly low given that the test case has got a rather large semi-bandwidth and a lot of rows. I guess the biggest surprise was how slow the triangular solvers are compared to the factorization I am using, which contains some very naive code and doesn’t do any fancy like cyclic look-ahead or hybridizing with the host CPU.

About the actual problem - Yes there are four rhs vectors with the same lhs matrix, the matrix is a factorized jacobian, the rhs come from stage value approximations of an implicit Runge-Kutta scheme and the whole thing is wrapped in a newton solvers, so there could be a number of calls per step. I have just about finished a blas 3 based version using dtrsm and dgemm which I suspect might be faster, even for the “skinny” nx4 rhs matrix my problems usually have.