Here is what I measured on a K20c (ECC enabled) for double-precision systems of size 6x6:
batch = 1000 dim = 6 time = 0.000373 sec
batch = 10000 dim = 6 time = 0.000872 sec
batch = 100000 dim = 6 time = 0.006235 sec
batch = 1000000 dim = 6 time = 0.058387 sec
batch = 10000000 dim = 6 time = 0.581739 sec
A few notes:
(1) These are solves of N distinct systems, each with a single RHS, using partial pivoting. If you have systems with multiple RHSs, you would probably want to look at the batched LU decomposition and batched TRSM in CUBLAS to better amortize the cost of the matrix transformation.
(2) Batch sizes of 1000 are definitely too small to fully utilize a K20, you would want to make your batches as large as possible, and closer to a minimum batch size of 10000 for best performance.
(3) The solver implementation in the downloadable package is based on shared memory storage, which is not optimal for very small systems (such as 6x6), especially on sm_35 which has a large register file. For the matrix inversion (also in the paclage) I therefore also implemented a one-matrix-per-thread approach that keeps all data in registers and is significantly faster for small systems, provided the batch size is sufficiently large (several thousand, as I recall). Looks like it is time for me to take a look at adding the same approach to the solver code :-)