I do not know how you are implementing this, but for a 784x784 lower triangular matrix I get less then 1 ms(it says 0 ms) to solve for AX=B using cublas Strsm(), which is faster than MATLAB.

It returns the correct answer.

Keep in mind that MATLAB is 10-100 times faster than a naive algorithm written in common programming languages. MATLAB is highly optimized and has had teams of Phds working on making it as fast a possible for over 13 years. GPU computing has not even been around half that time, but still whoops MATLAB for larger data sets.

Also are the matrices dense or sparse? GPUs will generally perform better on large dense matrices.

MATLAB also may look for sparsity patterns in dense matrices so it can use the faster sparse solvers. You need to test on full dense matrices using cuBLAS and MATLAB correctly. Some characteristics of the Matrix used in MATLAB may be cached if you run more than once, and that may give the MATLAB routine an advantage.

You are comparing apples to oranges and not correctly setting up the test.

GPUs scale very well, but for such a small matrix the difference will be less significant than a large (n>5000) matrix.

What GPU is being used and what CPU?

Is MATLAB taking advantage of the multithreaded capabilities of the CPU?

What does the command line for nvcc look like?

Are you running in debug mode with -G or -g flags?

Are you using the correct compute capability for your GPU? (1.0, 2.0, 3.0, 3.5 etc)

Are you counting memory copy time?

Are you awaking the GPU from idle state before the test or are you including this time in the test? (Newer GPUs can idle to save power and need a small amount of time to get going)

Even for a dense 4096x4096 matrix , cublas Strsm() takes less than 1 ms(0 ms) on my machine, and I know MATLAB will not be even close to that time for a 4096x4096 dense matrix.

This also returned the correct answer.

So If cuBLAS can solve on Kepler AX=B in less than 1 ms, you really cannot get much better than that.