cublas solving a linear system

Hi,

I have a 1-D array, say “x” with 441x1 elements. Now I have two more inputs - matrix “A(3x3)” and vector “b”(3x1) which are initialized apriori.

Now I need to solve Ax = b for different segmented values of “x” (of size 3x1) in parallel using cublas. All the variables, “A”, "b’ and “x” are of the double data type.

(ie) I would like to compute steps below in parallel:
A x(1 to 3) = b
A x(4 to 6) = b
A x(7 to 9) = b

Could you please suggest how this can be done using cublas calls?

Thanks,
Sreeram

This doesn’t make a lot of sense to me. If A and b are defined, why would you expect more than one solution of the linear system Ax=b ?

There is normally only one set of values for the vector x which will satisfy the linear system, barring degenerate cases.

Is there any particular reason this has to be done in cublas?

A similar problem that I have seen is solving Ax=b for a number of different b vectors:

Ax1 = b1
Ax2 = b2
Ax3 = b3

In this case, the solution vectors x1, x2, x3, may indeed be different. Such a problem/request would be sensible to me.

If you are looking for something like that, cusolver has a possible worked example. The example only covers a single b vector (so called “right hand side vector”, i.e. nrhs=1 in that example) but it should be trivial to extend it for the case where nrhs>1

http://docs.nvidia.com/cuda/cusolver/index.html#ormqr-example1

This use case does not look like a good fit for CUBLAS, or indeed GPUs. If I understand correctly, you are solving 147 different 3x3 systems, and what you call vector ‘x’ are actually 147 different right-hand sides ‘b’, and what you need to compute are 147 solution vectors ‘x’. Given how tiny the systems are, you should get best performance with a simple custom solver, with each thread solving a 3x3 system.

But the parallelism of this is way too since only 147 threads would be used, as a GPU needs on the order of 10,000 threads to reach full performance. So unless this computation is part of a bigger processing pipeline that has already been ported to the GPU, you would be better off running this on the CPU where this little data is easily stored in L1/L2 caches. Again, you would probably also want to use a custom solver rather than using a library such as MKL, as the library overhead would be substantial given the tiny matrices.

Hi all,

Thanks for your inputs. i have not correctly stated the problem. Apologies for that. Please see the problem I face below:

I have two input arrays in 1-D ===> 1) “A” of size 672841, 2)“b” of size 112141.

I need to solve in parallel:

A(1:36) x1 = b (1:6) and find x1 (of size 6*1)

A(37:73) x2 = b (7:12) and find x2 (of size 61)

A(67248:67284) x1869 = b(11208:11214) and find x1869(of size 6
1)

Could you please suggest how to solve this in parallel?

  1. There was a suggestion on writing a custom kernel. will that not be time consuming?

  2. Even if write a custom kernel, to find x1 = inverse(A1)b1 needs to be found. Then how to find the inverse of a 66 matrix?

  3. Will the batched solver help in this case? Could anyone please suggest.

Thanks.

So you basically want to solve:

A1x1=b1
A2x2=b2
A3x3=b3

in parallel.

There is a batched solver code available for download as a registered developer. I had difficulty finding it by navigation, but the direct link still seems to work:

https://developer.nvidia.com/rdp/assets/cuda-batched-solver-tgz

You’ll need to log in with your registered developer credentials before that link will work, I believe.