Odd timing result from using cuBLAS Strsm()

Quite often I have to calculate the inverse of a triangular matrix( in ‘L’ form, lower triangular), and recently have been using cuBLAS Strsm() to solve for the answer.

This generates the correct result in good time,but oddly when I do ‘the not-recommended’ thing and calculate the inversion of the entire matrix in dense form directly using GuasSeidel I get the same result in about half the time.

If you look at line #897 of this page in my GitHub GroupLasso project you can see the CUDA version of GuasSeidel which runs faster (at least for smaller inputs like 512x512) than cuBLAS Strsm();

[url]https://github.com/OlegKonings/CUDA_ADMM_Group_Lasso/blob/master/CUDA_Group_Lasso/CUDA_Group_Lasso/admm_main.cu[/url]

Again both return the same results, and for other linear algebra operations cuBLAS is much faster than my custom kernels so I am a bit mystified by this difference.

I would prefer to use Strsm() because it reduces the code by about 200 lines, but the direct version is much faster in my test cases so far.

Any ideas as to why this is the case?