Stripmining matmul for bandwidth optimization host-to-gpu for LLM computation

Matmul is an O(n3) operation consuming and producing O(n2) data, so matmul should always be possible to make compute limited rather than BW limited, given sufficiently large matrices. From my HPC days, I know that the traditional method of optimizing matmul is to stripmine the loops, generally with as many stripmining steps as there are BW bottlecks (unless you do something more clever). This should allow you to stripmine down all cache level BW-bottlenecks, as well as GPU memory bottlenecks and for sufficiently large matrices, even the PCIe bridge to CPU memory bottleneck.

I run local LLMs in ollama. This uses the llama.cpp package which in turn uses the cuBLAS package. However, when using a model that doesn’t fit on my GPU card, GPU load drops precipitously. I am trying to backtrace where the bottleneck is, so I am wondering, does cuBLAS do full loop stripmining of matmul (and other operations used in LLMs)? In theory, with the large matrices used in a 70B model, it should be able to fetch data all the way from CPU memory without losing compute performance, however, is cuBLAS optimized enough to do this? It would also require bank switching weights from CPU memory into GPU memory while the results of the previous layer are being computed. Is this possible on a somewhat modern (Ampere) Nvidia GPU given that total BW is available or is the memory bus locked while computing?

No, ordinary cublas (for example cublasSgemm) works in a fashion similar to industry-standard blas, like MKL and others. The entire matrix/data set has to fit in device memory.

It is possible to chunk up a matrix and do the optimizations you refer to, but “ordinary” cublas doesn’t do that automatically. cublasXt has some ability to do this automatically.

A CUDA GPU has the ability to do several types of operations concurrently, including running GPU code, while transferring data both to and from device memory. cublasXt will do this, and of course you can do it yourself, arranging those calls with calls to cublasSgemm or similar, operating on chunks of data.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.