I am in the process of revisiting some older linear algebra code of mine with the aim of getting it in shape for CUDA 4.0. In the course of this I though I would try building a GPU version of the lapack routinely DPBTRS using cublasDtbsv directly - so basically a pair of cublasDtbsv calls for forward and backward substitution for each right hand side. The results for a N=20000 matrix with a semibandwidth of 1900 and 4 rhs vectors:
GTX470 Cublas 3.2: total time = 1.453503, dbptrf = 0.974225, dbptrs = 0.479278 GTX470 Cublas 4.0: total time = 1.415825, dbptrf = 0.920702, dbptrs = 0.495123 ACML4.4 (4 threads): total time = 5.026893, dbptrf = 4.853224, dbptrs = 0.173669
(all times in seconds, no host device memory copies included in the GPU timing, averaged over 20 runs).
While my GPU band Cholesky factorization is about 5 times faster than the multithreaded ACML version, the CUBLAS dbsv performance is pretty terrible - something like 3 times slower than the CPU version. Looking a the profiler output tells the story - the two cublasDtbsv kernels dtbsv_main_lo_nt, dtbsv_main_lo_tr are using almost 40% of the total GPU time recorded by the profiler.
Anyone else using this routine? Any suggestions for alternatives?