DGX-1 P100(16GB*8) VRAM Question: 16GB or 128GB?

Hi everone,

need some help with a question about the VRAM issue I’m facing with my DGX-1 P100 server. The server I’m using has 8 P100 GPUs, each with 16GB of VRAM. I’m wondering if the GPU memory combines through NVLink’s shared memory mechanism during program execution to provide an effective VRAM of 128GB (16GB * 8), or if each GPU is limited to a maximum of 16GB only. I’m asking this because after updating my machine to the latest version, I’ve noticed that running models that require more than 16GB results in an ‘out of memory’ error.

The result of command:
$sudo nvsm show alerts

  • GPU 4’s NvLink link 3 is currently down Run a field diagnostic on the GPU.
  • GPU 7’s NvLink link 1 is currently down Run a field diagnostic on the GPU.

$nvidia-smi nvlink -s

Hi @asdewq45445 !

The answer to does “…GPU memory combines through NVLink’s shared memory mechanism during program execution to provide an effective VRAM of 128GB (16GB * 8), or if each GPU is limited to a maximum of 16GB only.” is the latter. Each GPU has local memory, so in general allocations need to fit within that size for each GPU. This can be frustrating as models and data sizes grow, and one of the reasons you see more memory on newer NVIDIA GPUs.

There are some ways that GPUs can access the memory of others (or system memory), but it doesn’t really get you to “I have 8*16GB of GPU RAM” that you’re looking for at the Pytorch/etc. level.

The two GPUs with the NVLink connection that’s down is a different problem, and is symptomatic of a hardware issue.

ScottE