Hi,
This could be a feature request, really. When the factors of a matrix cannot fit into the GPU’s memory, the hybrid mode becomes handy, and some of the factors go to the host memory. This causes some time increase due to the communication overhead, but this is not a problem. The problem is at the solution phase, it takes the biggest hit in terms of performance, and I may need to use those factors for, let’s say, 100 different times.
My question and request would be, would it be possible to move all the factors to the host side when they cannot fully fit into the GPU memory, and perform the solution phase on the CPU side to achieve some performance? I’m aware of the MGMN mode too, but this is a scenario for a single GPU case.
Regards
Deniz
Hi Deniz!
Sorry for the delay, we were busy with cudss 0.7.0 which has finally arrived with lots of new features!
One argument against your idea can be that solve is often dominated by pushing the factors (incl. the symbolic information) through the memory. So if the matrix is big and the hybrid memory mode is needed, then factors will be in the host memory but symbolic data currently would be on the device.
Then the solve time on the CPU would be bound by CPU memory BW, while hybrid memory mode solve is CPU-to-GPU + (may be overlapped) GPU memory BW and it depends on the system (especially, on the type of memory interconnect) what would be optimal for performance.
In general, the answer is yes, what you describe can be a viable optimization direction. Would you be interested in providing a motivating use case? We can of course evaluate potential effect of this optimization on our internal data but it is always helpful to have a customer application to speed up.
Thanks,
Kirill
Hi Kirill,
Yes, I saw the 0.7 update 2 days ago, and this update could be the biggest one. I wanted those separate backward and forward solutions; I can use Schur complement mode too. Multi-GPU mode is nice without the need for NCCL (Windows case). I’m happy that the deterministic mode is implemented. I’m also happy that developers do listen, or they know what people want already.
My motivation is: let me describe the home-made workstation I have, it is a high-end consumer-grade PC with 8 gaming-grade cards connected to it. They are connected through PCIE and NVME slots (it is a cheap alternative to server-grade options). So, my cards’ link/memory bandwidth between the host is not great; it is not terrible either (3GB/s or 6GB/s), but I think the latency between them is causing issues. I’m guessing this because while the factorization phase takes a small hit when it has to use the host memory to store the factors and the solution phase takes an enormous hit when it needs to access those factors. I also saw this “synchronization issue” when I used multiple GPUs (let’s say 2) to run AI models locally. So, this makes me think that if I’ve used multiple GPUs to factor a matrix and I would run into the same problem again in the solution phase.
The reason I made such a request is this: I’m a geophysicist and I’m modeling EM waves in the frequency domain. So, if I have 30 different frequencies, let’s say, I need to solve 30 different Ax=b (sparse, complex, and symmetric) linear equations set to obtain the earth’s response. I can solve those linear systems iteratively, and that is what I was doing until cuDSS, but if I can factorize all the matrices and hold all of them at the same time, I can use those factors to solve an inverse/optimization problem in a very fast manner. (But I need to use those factors maybe 100 or 200 times for different b vectors. Furthermore, if I have multiple factors, I will need to make a plan for which factors should stay on the GPU side and which ones should stay on the CPU side.)
I know if the factors stay only on the host side, the solution phase on CPU will be slower due to it being memory BW limited, but I believe it is a better option than a GPU trying to access the factors on the host side (at least for my case).
Regards
Deniz