Running out of global memory

On my GPU architecture, the total global memory is16,945,512,448 bytes. I am looking to allocate 25 arrays in the global memory that are of type double and length 141918208.

This exceeds the global memory storage, the error message I receive when I run cuda-memcheck confirms this.
“Program hit cudaErrorMemoryAllocation (error 2) due to “out of memory” on CUDA API call to cudaMalloc.”

Is there a way to get around this?

If your GPU compute capability is 6.0 or higher (can discover this with deviceQuery sample code) and you are on linux, you can oversubscribe your GPU memory if you use a managed allocator such as cudaMallocManaged. There will be limits to this. The overall size you are indicating is ~28GB, so if you have limited host memory also, the oversubscription may not work (out of memory). Furthermore oversubscription may have substantial performance penalties depending on how the arrays are used, and carries the general performance concerns associated with managed memory use.

Alternative approaches may also involve breaking your data set into pieces, and move pieces to and from GPU memory while the computation proceeds. This is a general concept known as overlap of copy and compute, and there are numerous online resources to learn about it such as here.

What model of GPU is this? Are you running with ECC enabled? What operating system are you using? If on Windows, are you running the GPU with WDDM or TCC driver?

Generally speaking, it is impossible to use all of the GPU memory for applications.

On older GPUs that support ECC, 6.25 % of the memory capacity are needed to store the additional ECC data. It is possible to turn off ECC support with nvidia-smi, this requires a reboot to take effect.

The CUDA runtime needs some GPU memory for its it own purposes. I have not looked recently how much that is. From memory, it is around 5%.

Under Windows with the default WDDM drivers, the operating system reserves a substantial amount of additional GPU memory for its purposes, about 15% if I recall correctly.

Thank you for your responses. I ran the deviceQuery and here are the results:

Device name: Tesla V100-SXM2-16GB
Compute capability: 7.0
Total global memory: 16,945,512,448 bytes
Total constant memory: 65536 bytes
Max grid size, dim(0): 2147483647
Max grid size, dim(1): 65535
Max grid size, dim(2): 65535
Max threads per block: 1024
Max block size, dim(0): 1024
Max block size, dim(1): 1024
Max block size, dim(2): 64
Shared memory per block: 49152 bytes
Registers per block: 65536
Clock frequency: 1530000 khz
Asynchronous engines: 6
Multiprocessors on device: 80

The OS of my host computer is Linux. I haven’t enabled ECC. Where can I find more resources on enabling ECC?

Regarding the cudaMallocManaged, where can I find the limits to the size in GB? On my host computer, I have close 1 TB free storage left.

I probably misunderstood the original question. On re-reading it, it seems you are trying to allocate 25 arrays of equal size, each of which has 141918208 elements, for a total of 25 x 141918208 x 8 = 28,282,641,600 bytes or 27.3 GB, which obviously exceeds the size of the physical memory on the card.

If so, you would want to follow Robert Crovella’s advice.

From the nvidia-smi man page / documentation:

-e, --ecc-config=CONFIG
Set the ECC mode for the target GPUs. See the (GPU ATTRIBUTES) section
for a description of ECC mode. Requires root. Will impact all GPUs
unless a single GPU is specified using the -i argument. This setting
takes effect after the next reboot and is persistent.

If you just want to check whether ECC is currently enabled, run nvidia-smi -q and look in the section Ecc Mode.

There are no published limits, formulas, or arithmetic. My suggestion would be to give it a try. I’m quite confident you can oversubscribe to ~30GB on a 16GB V100 in a system that has ~1TB of system memory (not “storage”. We are talking system ram, not disk space.)

Thank you, is there a way to determine the amount of RAM usage once the job has been completed on a GPU node?

After a process or job completes, I know of no way to get process or job statistics about that job, unless you have some sort of monitoring utility running. That is a general statement (for me), not specific to GPU/CUDA, but including GPU/CUDA.

If you want to monitor job behavior, and are willing to install tools, so that you can get historical data, then I believe there are many tools to do this. In terms of a tool that is GPU-enabled to do this, you could look at DCGM.

For a lightweight option, you could run a separate process that is looping nvidia-smi on the same node that the job is running, and inspect the output. If you have spiky allocation (cudaMallocXXX - cudaFree) happening in less than a second, you may not capture peak/max that way. I’m not sure DCGM will either.

If you have complete source code access, you could instrument your code (e.g. wrappers around cudaMallocXXX-cudaFree) to capture any statistics you want.

Also note that if you are using a managed allocator, perhaps also with oversubcription, I wouldn’t really expect the output from nvidia-smi or any other external GPU monitoring utility to tell you much that is useful from the memory reporting. There isn’t a strong connection between the reported physical memory usage and whatever you are doing with cudaMallocManaged, in this case. For example if you oversubcribe a 16GB GPU to 24GB, I wouldn’t expect nvidia-smi to report any number like 24GB.

Is the total global memory (retrieved after running deviceQuery) a subset of the total RAM on the node?

Device name: Tesla V100-SXM2-16GB
Total global memory: 16,945,512,448 bytes

A GPU has its own (device) memory. That is the 16GB being reported there. If you have multiple of these GPUs in your node, each GPU will have its own 16GB.

The “node” will also usually have some system memory also called cpu memory or host memory. This is separate from anything reported by deviceQuery.