I have a 3 GB VRAM and I can only allocate an array of 520^3 vector4, above that the application just ignore the malloc.
I’ve tried with cudaMallocManaged() and cudaMallocHost( but they are both limited to the (shared) VRAM size as well, plus it slows the performance by 70%
Assuming this is a Windows 10 system using the default WDDM driver, the maximum allocation via cudaMalloc will be about 81% of the GPU memory. With 3GB of GPU memory, this is about 2,600,000,000 bytes.
For GPUs with larger GPU memory, this percentage can be a tad higher. For example, on my Quadro RTX 4000 with 8 GB of GPU memory, the maximum allocation size via cudaMalloc is 7,060,320,000 bytes, or 82.2% of total physical memory.
If the programmer requests more memory than can be allocated, cudaMalloc() informs the programmer of that fact via the returned status code. That’s different from silently ignoring the request and is the best it can do.
NVIDIA offers GPU with all kinds of memory sizes up to 80 GB, so you might want to use a different one. I read that cloud services offer large GPU instances at reasonable prices, or you could look into buying more capable hardware, possibly previously owned if your budget is tight.
You might also want to ponder whether your data could be stored more efficiently, e.g. by using half precision instead of single precision.
I cannot make forward looking statements, but I’m pretty familiar with the GPUs produced by NVIDIA since about 1998. We have never produced a discrete GPU with upgradeable memory. It’s a possible datapoint to consider, when speculating about the future.