On CPU code, you can use mmap to use your disk as (practically) infinite virtual RAM. It’s just slower
Does CUDA have something similar? For example, if I have some AI model with 175 billion parameters (at least ~350GB VRAM needed), but I only have 24GB GPU, is there any way for the model to run on GPU just by changing cudaMalloc into something like cudaMallocMmap (doesn’t exist) ?
in some situations, managed memory can allow for a GPU allocation up to approximately the size of your host ram (it will be less, of course, but if you had 1 TB of host system memory, it should be possible to create a managed allocation of e.g. 768GB or larger). Something similar is possible with a pinned allocation, but a pinned allocation will behave differently than a managed allocation, and operating systems present some limits to the size of a pinned allocation which might be substantially lower than an upper bound of total system RAM size.
There isn’t any formal way to go beyond that (e.g. to use disk resources as the backing for an allocation). If you do some searching on forums, you will find people who have discussed how to use mmap with cudaHostRegister to do unusual things like let a disk file be the backing of an allocation, but these appear to be corner cases to me, and don’t represent anything like mainstream CUDA programming.
The operative word is a lot slower. My experience is that when one accidentally exhausts system memory and gets into a swapping regime, performance drops like a stone, e.g. by a factor of 100x. Not practical. Instead of using some automatic page swapping mechanism, one would be better off by tackling data exchange manually, i.e. a carefully orchestrated out-of-core algorithm.
While system memory provides a faster backing store than mass storage media, the interconnect between host and device is akin to a “straw” that would still be the major bottleneck in any automatic swapping regime. You have system memory with 200 GB/sec throughput (8 channels of DDR4-3200) and GPU memory with 2 TB/sec throughput (on an A100-PCIe), and in between a 25 GB/sec (per direction) PCIe gen 4 pipe.
I am highly skeptical that the specific setup mentioned (175B parameter model using a single 24 GB GPU) has any practical utility, regardless of how the data exchange is organized. Correct me if I am wrong, but it is my understanding that models with 175 billion parameters (e.g. BLOOM, GPT-3) are presently at the upper limit of AI model size on non-specialized platforms (such as Cerebras). Large models on general-purpose hardware typically use highest-end systems and parallelize across hundreds of GPUs.