Double buffer requirement for SpSV and SpSM operations

Hi,

I have a basic preconditioned conjugate gradient (CG) routine that solves the Ax=b linear system with a preconditioner matrix M. The preconditioner matrix is split into two triangular matrices by ILU(0). Afterwards, in each CG iteration, those triangular matrices are applied to a vector via SpSV or SpSM functions.

My question, or in better words, my feature request is this: during the analyses phase of SpSV, I have to allocate two buffers, both for the backward sweep and for the forward sweep. I’ve realized I cannot use a single buffer for both operations. I think I’m right on this one.

The problem is this: for small matrices, the buffer size can be considered negligible, but I want to give you some numbers. I have a matrix A in CSR format that consumes around 4.5GB of VRAM; at the same time, I have a preconditioner matrix that consumes ~2GB of VRAM. I realized that those two buffers allocate 2.4GB of memory for each of them. That means 4.8GB of memory is just allocated for sparse triangular solutions. That is a lot of memory, and I think I cannot use a single buffer and save at least 2.4GB of memory.

In short, I wish I could use a single buffer for those sparse operations and save some memory space in the future.

Regards

Deniz

Hi @Coercion ,

Yes, SpSV and SpSM require separate buffers for the forward and backward sweeps, and these buffers cannot be overlapped. We will keep this request in mind for future consideration. You mentioned using CG; could you please share more details about the overall application?

Thanks,

Mohammad

Hi again,

I’m a geophysicist who models electromagnetic waves. So I use Maxwell equations in the frequency domain and solve a linear system Ax=b, where A is a complex symmetric but non-Hermitian matrix. I said CG, but I coded my own solver, and I’m officially using the GPBiCG algorithm.

The forward problem is coupled with an inverse problem, but I’ve already coded all this stuff up using CUDA-C, so I’m all good. On the other hand, the immense memory requirement of the triangular matrix solutions surprised me, even if those triangular matrices are created using ILU(0). Maybe it wasn’t like this in CUDA 12 or 11; I didn’t check it.

Regards

Deniz