Solving a Linear System or a Least Square Problem on Multiple GPUs

Ziqi · February 25, 2019, 4:16am

Hello,

I am currently developing an acoustic solver. For a small problem I can solve it using a single GPU. However, for a large problem, the dense matrix generated cannot be saved on memory of a single GPU. Thus, I need to distribute the linear system to multiple GPUs. I am wondering if the Nvidia solver is capable of solving a system distributed on multiple GPUs. Thanks.

Best,
Ziqi

njuffa · February 25, 2019, 7:05am

This is outside my area of expertise, but I note that the documentation for cuSOLVER (https://docs.nvidia.com/cuda/cusolver/index.html)does not mention multi-GPU operation. I am curious: How large are these systems that they don’t fit into the up to 32 GB of modern GPUs? The literature seems to indicate that this is not an uncommon problem, e.g.

Manuel A.Diaz, Maxim A.Solovchuk, and Tony W.H.Sheu, “High-performance multi-GPU solver for describing nonlinear acoustic waves in homogeneous thermoviscous media.” Computers & Fluids, Vol. 173, 15 September 2018, Pages 195-205
“A double-precision numerical solver to describe the propagation of high-intensity ultrasound fluctuations using a novel finite-amplitude compressible acoustic model working in multiple processing units (GPUs) is presented. […] The present multi-GPU implementation aims to make the best use of every single GPU and gain optimal performance of the algorithm on the per-node basis. To assess the performance of the present solver, a typical mini-server computer with 4 Tesla K80 dual GPU accelerators is used.”

Ziqi · February 25, 2019, 6:29pm

First, thank you for reminding me that the largest GPU memory is 32 GB.

32 GB is adequate for our problem, but I have to say it is only adequate either for a small setting with a high frequency or for a large setting with a low frequency. If we wanted to simulate sound phenomenon at 20kHz in an air plane, 32 GB would be far from enough. This is what has driven me to scale my solution to a problem of any size.

njuffa · February 25, 2019, 7:29pm

So I take it these use cases typically involve dense matrices upward of 40K x 40K?

When I mentioned GPUs with 32GB of memory I was thinking of the Quadro GV100. It seems NVIDIA has since announced a Quadro RTX 8000 with 48 GB, but as a Turing-based design that obviously has low double-precision throughput.

Ziqi · February 25, 2019, 7:36pm

Yes. For instance, if we want to simulate sound phenomenon related to a human body up to 20kHz, we need a mesh of around 90000 triangles. The matrix size is 90K*90K in such a case. If we want to simulate acoustic phenomenon within an airplane still up to 20kHz, the mesh size is much larger than that of a human body. Let alone a rocket or a ship. Thus, to scale our solver to one that is capable of simulating all acoustic phenomenon, the solver has to include the case of using multiple GPUs.

njuffa · February 25, 2019, 7:45pm

Thanks for the explanation, this is interesting to me as the first implementer of CUBLAS (2005-2008). There are plenty of use cases involving big sparse matrices, but I had not encountered one involving huge dense matrices.

The literature seems to indicate that researchers build their own multi-GPU capable solvers for uses cases like yours. Let’s see whether someone can provide a pointer to a “standard” multi-GPU capable solver, from either NVIDIA or a third party.

Ziqi · February 25, 2019, 7:49pm

Please accept my salution. I implemented several QR solvers myself, but none of them is comparable to the solver from NVIDIA. You did a really great job.

njuffa · February 25, 2019, 7:59pm

As I said, I only created the initial (incomplete, only partially tuned) implementation of CUBLAS and later on provided some initial ideas for batched operations on small matrices.

Several other more capable people have since expanded the meager starting points I provided into the highly-tuned linear algebra libraries (plural) provided by NVIDIA; they deserve full credit for what you are using today. I retired in 2014.

Ziqi · September 11, 2019, 11:58pm

I have a question about nvlink. When multiple GPUs are linked through nvlink, are they viewed as a single GPU? For instance, my code runs well on a single GPU with 8GB memory. If 4 such GPUs are linked through nvlink, can I run the same code on the linked gpus with a much larger model that fits in 32GB?

Robert_Crovella · September 12, 2019, 12:45am

No, they are not viewed as a single GPU. you must use multi-gpu programming methods whether the GPUs are connected via nvlink or not.

Ziqi · September 12, 2019, 2:51am

Then is there any GPU that can be viewed as single with a 24GB or more memory?

Ziqi · September 12, 2019, 2:52am

If so, can I get all the specific types of GPUs that meet my needs?

Ziqi · September 12, 2019, 3:06am

For instance, the tesla v100 has 16 GB memory. Can I assume that I can write my CUDA code using the single GPU mode if my problem fits in the size of 16 GB?

Robert_Crovella · September 12, 2019, 3:25am

Tesla V100 that is currently for sale has 32GB of memory. Quadro RTX 8000 has 48GB of memory.

Yes, if your problem fits in GPU memory, you should be able to run it on a single GPU.