Batched solver code available

What kind of performance have you guys seen with this batch solve for a 6x6 or smaller system?

I have several million system solves of size 6x6 that I’m trying to speed up, in double precision. Currently, I have a batched LU partial pivoting routine with bsolv that will do 20 million solves in 1-2s depending on batch size. I’m using a K20. I’m interested in batch sizes from 1000 - 20000.

Here is what I measured on a K20c (ECC enabled) for double-precision systems of size 6x6:

batch =     1000  dim = 6  time = 0.000373 sec
batch =    10000  dim = 6  time = 0.000872 sec
batch =   100000  dim = 6  time = 0.006235 sec
batch =  1000000  dim = 6  time = 0.058387 sec
batch = 10000000  dim = 6  time = 0.581739 sec

A few notes:

(1) These are solves of N distinct systems, each with a single RHS, using partial pivoting. If you have systems with multiple RHSs, you would probably want to look at the batched LU decomposition and batched TRSM in CUBLAS to better amortize the cost of the matrix transformation.

(2) Batch sizes of 1000 are definitely too small to fully utilize a K20, you would want to make your batches as large as possible, and closer to a minimum batch size of 10000 for best performance.

(3) The solver implementation in the downloadable package is based on shared memory storage, which is not optimal for very small systems (such as 6x6), especially on sm_35 which has a large register file. For the matrix inversion (also in the paclage) I therefore also implemented a one-matrix-per-thread approach that keeps all data in registers and is significantly faster for small systems, provided the batch size is sufficiently large (several thousand, as I recall). Looks like it is time for me to take a look at adding the same approach to the solver code :-)

Taking a closer look, the execution times reported above are somewhat inflated for the smaller batch sizes due to the way I measured them as end-to-end execution times as observed from the host, which includes synchronization overhead as well as the overhead for the heuristics in the host-side code. I am running this on a fairly old system which is likely an additional contributing factor. The execution time data in the previous post corresponds to the following measurement framework, where wallclock() is a high-precision system timer:

CUDA_SAFE_CALL (cudaThreadSynchronize());
time1 = wallclock();
status = dsolve_batch (A_d, b_d, b_d, N, batch);
CUDA_SAFE_CALL (cudaThreadSynchronize());
time2 = wallclock();

Using the CUDA profiler to extract the pure kernel execution time gives the following results:

batch =     1000  dim = 6  time = 0.000067 sec  [GPU kernel only]
batch =    10000  dim = 6  time = 0.000593 sec  [GPU kernel only]
batch =   100000  dim = 6  time = 0.005904 sec  [GPU kernel only]
batch =  1000000  dim = 6  time = 0.058278 sec  [GPU kernel only]
batch = 10000000  dim = 6  time = 0.581364 sec  [GPU kernel only]

This shows a pretty good linear scaling of execution time as one would expect.

I use PGI Fortran and I implemented the solver I wanted by creating an interface for a subroutine in C that calls the Batch solver code. Out of curiosity does anyone have a simpler, more robust way of binding the C interface of the Batch solver code with Fortran?

A simpler way is to use a Fortran interface using the iso_c_binding module.
Appendix C in the “CUDA Fortran for Scientists and Engineers” book has a full description and there are many other examples in the book.