Batched solver code available

The source code for an efficient solver and matrix inversion for small matrices using partial pivoting, is now available to all registered developers.

On sm_2x GPUs (48KB shared memory), the maximum dimensions for the matrices are:
    DP      solver, non-batched matrix inverse:  2…76
    DP complex  solver, non-batched matrix inverse:  2…53

    DP      batched matrix inverse:        2…77
    DP complex  batched matrix inverse:        2…55

The code has been released under BSD license.

It is available from the CUDA registered developer web site:

Thank you for making this code available.

Is there some simple way to make this code run using single precision complex numbers?

I’m currently working on a batched solver my self for small (dim 2-32) positive definite systems using Cholesky decomposition.
Is there any plans to extend this code to support these kind of systems?

The code uses templates, it should be very easy to change to single precision complex numbers

Thanks, but I was more thinking about all this config optimization.
Complex float would have the same config as double, so I just copy-paste that one.
Will the same config be optimal for non-complex float as well? With float you could fit more into shared memory, and therefore support larger matrices, so some changes to the launch config must be done.

nice, thats very useful stuff for many applications


I’m a fortran programmer. I call batched solver through fortran’s interface, and I think it works perfectly. But I wonder if you can provide a batched solver which can solve least square problem. That will be helpful to me. Also I want you add single precision suport, that will be perfect!

I can’t view the code, but i’m alredy registered on Nvidia developer Zone, but i can’t sign in on the link you provide. Where can I do the registration?

Here is the process I followed when I signed up recently. It sounds like you already completed the first step.

  1. Get a DevZone ID
    a. Top Right either new account or if you have a devzone account just login.

  2. Once logged in go to “my account” top right
    a. Complete the Basic Profile, and then complete the CUDA registered developer form.

  3. You will receive email confirmations and will usually be approved within one business day

  4. Once approved the CUDA registered developer program home page is accessible via the “My Account” - the program name will be green and is a hyperlink.

Hi I have registered as a Basic Registered Developer, I have registered for this form and I have applied for the CUDA/GPU Computing Registered Developer Program. When I click on the link given above; , I get to a login screen but it does not recognise my login or password. After trying several times it appears I am being blocked, I cannot see the login page, I am now getting “The connection has timed out , The server at is taking too long to respond.”

There are two registered developer websites, the old site and the new site. Best I am aware, they do not share login information. So the problem may simply be a mismatch between login information and website.

To access the code via the new registered developer website:

(1) Go to
(2) On right hand side, click on green link “Registered Developers Website”
(3) Log in or create new account as needed
(4) click on green link “CUDA/GPU Computing Registered Developer Program”
(5) Scroll down to section “CUDA Batch Solver”
(6) Click on green link “follow this link”
(7) Click green “I ACCEPT” at usage agreement
(8) your download should start

To access the code via the old registered developer website:

(1) Go to
(2) Sign in with email address and password
(3) There is a menu on the right hand side titled “Newest Downloads”
(4) Click on link “Batched Solver”
(5) Click on link “download”
(6) Click “Accept” button below the usage agreement
(7) your downoad should start

Anyone succeeded to convert this to single precision (float) ?

The main problem seems to be the config class, which I’m not sure how to define it for float.


If someone has a float/complex version, I’m also very interested!

Do you need the float/complex solver, or float/complex matrix inverse? By the way, there are no dark secrets for finding the data needed for the config class, but it takes time to run necessary experiments to find the best configuration.

There is a version of Matrix Inversion which I use for convex optimization problems. Have tested it on matrices as large as (4,000 by 4,000) and it works well on dense matrices.

Also libraries like CULA or MAGMA allow you to solve for it.

Not sure about a complex version though.

For batches of small matrices, CUBLAS offers getrfBatched to compute the LU decomposition and trsmBatched to solve triangular systems.

for matrix inversion, i’m using the batched version of cgemm to estimate many covariance matrices and then I need to inverse all of them. Float/complex solver could also be usefull.

You’re right, it’s just that i don’t have enough time right now to do it :) Thanks for tip, i didn’t know that getrf was now batched, it could be usefull in the future!

@CudaaduC MAGMA and CULA works on only one matrix at a time. My matrices are too small (50x50 to 100x100) to benefit any GPU gain with these functions.

Also, if I use streams with the non-batched function, will it work?

I tried streams+non-batched cgemm and batched cgemm, I different results depending on the GPU and the precision used.

Example (Tesla K20c), 1000 64x64 complex matrices estimations, with 121 estimates for each matrix:

    streams(32bit) = 597.436072 GFLOPS
    batched(32bit) = 605.989044 GFLOPS

    streams(64bit) = 546.696069 GFLOPS
    batched(64bit) = 237.250216 GFLOPS

Release 1.1 of the batched solver / matrix inverse code has been posted. See this announcement for download instructions:

What a good news !!

Thanks :)

Question: for single precision is the number of operation (for one matrix) egal to 2/3 N^3 ?