CUBlas and very large matrices

I have managed to write a reasonably fast QR solver for the CPU, to use on very large matrices.

I know, I know, use BLAS and it will probably go even faster, but it’s a fairly simple algorithm, integrated with my other routines, and I’m probably within the ballpark of what CPU-BLAS can do…

But I’d like to surpass the CPU and just jump straight to the GPU for really high acceleration. The matrices I’m trying to decompose are upwards of 18GB, which exceeds the memory of many cards. Does cuBLAS have any features that will allow it to perform its decompositions in stages when the matrix size or other memory requirements would exceed the card’s memory? Having coded the algorithm myself I see how I might be able to copy blocks of the matrix up and down to do the Householder operations in stages.

Have you checked whether there is anything suitable in the CUBLAS XT API?

[url]https://docs.nvidia.com/cuda/cublas/index.html#using-the-cublasXt-api[/url]

I recall that in the past there were multiple third-party efforts regarding out-of-core matrix operations with CUDA. Some were proprietary, while some others were presented at various GTCs. No particular project name comes to mind, though.

FWIW, GPUs with up to 48 GB of onboard storage are currently available (but may not be cheap :-)

Thanks, njuffa! (Omitting your real name because, I dunno, nobody seems to use them around these boards…)

The algorithm I had envisioned, upon closer inspection, would be no faster–I’d be doing a single mult-add for every number I punted to the GPU via cudaMemcpy(). Not to mention one mult-add for pulling it straight from global memory. I can see why it’s not easy to do these matrix problems on a GPU–very bandwidth-intensive!

For now, I think my little CPU algorithm can cope. The problem is that I’m looking at matrices 250,000 rows x 10,000 columns, up from 250,000 rows by 1000 columns. The cost comes as the number of columns squared, so a hundred-fold increase. The original matrices took about 12min to solve on a single 2.5GHz CPU, so for my full-on problems I may be looking at upwards of a day. But that is still OK, as long as I know that the result is going to get delivered.

Come to think of it, CUBLAS-accelerated HPL (high-performance Linpack) used as a benchmark for super computers is a highly-tuned distributed out-of-core solver, is it not? Not a QR solver, best I know, but many of the same principles regarding panel handling might be applicable. At one time, NVIDIA made their HPL source code available, not sure whether they still do.

I am curious what practical use case gives rise to dense systems of this size.