Solving a Linear System or a Least Square Problem on Multiple GPUs

Hello,

I am currently developing an acoustic solver. For a small problem I can solve it using a single GPU. However, for a large problem, the dense matrix generated cannot be saved on memory of a single GPU. Thus, I need to distribute the linear system to multiple GPUs. I am wondering if the Nvidia solver is capable of solving a system distributed on multiple GPUs. Thanks.

Best,
Ziqi

This is outside my area of expertise, but I note that the documentation for cuSOLVER (https://docs.nvidia.com/cuda/cusolver/index.html)does not mention multi-GPU operation. I am curious: How large are these systems that they don’t fit into the up to 32 GB of modern GPUs? The literature seems to indicate that this is not an uncommon problem, e.g.

Manuel A.Diaz, Maxim A.Solovchuk, and Tony W.H.Sheu, “High-performance multi-GPU solver for describing nonlinear acoustic waves in homogeneous thermoviscous media.” Computers & Fluids, Vol. 173, 15 September 2018, Pages 195-205
“A double-precision numerical solver to describe the propagation of high-intensity ultrasound fluctuations using a novel finite-amplitude compressible acoustic model working in multiple processing units (GPUs) is presented. […] The present multi-GPU implementation aims to make the best use of every single GPU and gain optimal performance of the algorithm on the per-node basis. To assess the performance of the present solver, a typical mini-server computer with 4 Tesla K80 dual GPU accelerators is used.”

First, thank you for reminding me that the largest GPU memory is 32 GB.

32 GB is adequate for our problem, but I have to say it is only adequate either for a small setting with a high frequency or for a large setting with a low frequency. If we wanted to simulate sound phenomenon at 20kHz in an air plane, 32 GB would be far from enough. This is what has driven me to scale my solution to a problem of any size.

So I take it these use cases typically involve dense matrices upward of 40K x 40K?

When I mentioned GPUs with 32GB of memory I was thinking of the Quadro GV100. It seems NVIDIA has since announced a Quadro RTX 8000 with 48 GB, but as a Turing-based design that obviously has low double-precision throughput.

Yes. For instance, if we want to simulate sound phenomenon related to a human body up to 20kHz, we need a mesh of around 90000 triangles. The matrix size is 90K*90K in such a case. If we want to simulate acoustic phenomenon within an airplane still up to 20kHz, the mesh size is much larger than that of a human body. Let alone a rocket or a ship. Thus, to scale our solver to one that is capable of simulating all acoustic phenomenon, the solver has to include the case of using multiple GPUs.

Thanks for the explanation, this is interesting to me as the first implementer of CUBLAS (2005-2008). There are plenty of use cases involving big sparse matrices, but I had not encountered one involving huge dense matrices.

The literature seems to indicate that researchers build their own multi-GPU capable solvers for uses cases like yours. Let’s see whether someone can provide a pointer to a “standard” multi-GPU capable solver, from either NVIDIA or a third party.

Please accept my salution. I implemented several QR solvers myself, but none of them is comparable to the solver from NVIDIA. You did a really great job.

As I said, I only created the initial (incomplete, only partially tuned) implementation of CUBLAS and later on provided some initial ideas for batched operations on small matrices.

Several other more capable people have since expanded the meager starting points I provided into the highly-tuned linear algebra libraries (plural) provided by NVIDIA; they deserve full credit for what you are using today. I retired in 2014.

I have a question about nvlink. When multiple GPUs are linked through nvlink, are they viewed as a single GPU? For instance, my code runs well on a single GPU with 8GB memory. If 4 such GPUs are linked through nvlink, can I run the same code on the linked gpus with a much larger model that fits in 32GB?

No, they are not viewed as a single GPU. you must use multi-gpu programming methods whether the GPUs are connected via nvlink or not.

Then is there any GPU that can be viewed as single with a 24GB or more memory?

If so, can I get all the specific types of GPUs that meet my needs?

For instance, the tesla v100 has 16 GB memory. Can I assume that I can write my CUDA code using the single GPU mode if my problem fits in the size of 16 GB?

Tesla V100 that is currently for sale has 32GB of memory. Quadro RTX 8000 has 48GB of memory.

Yes, if your problem fits in GPU memory, you should be able to run it on a single GPU.