DGX-1 P100(16GB*8) VRAM Question: 16GB or 128GB?

asdewq45445 · December 15, 2023, 5:26am

Hi everone,

need some help with a question about the VRAM issue I’m facing with my DGX-1 P100 server. The server I’m using has 8 P100 GPUs, each with 16GB of VRAM. I’m wondering if the GPU memory combines through NVLink’s shared memory mechanism during program execution to provide an effective VRAM of 128GB (16GB * 8), or if each GPU is limited to a maximum of 16GB only. I’m asking this because after updating my machine to the latest version, I’ve noticed that running models that require more than 16GB results in an ‘out of memory’ error.

The result of command:
$sudo nvsm show alerts

GPU 4’s NvLink link 3 is currently down Run a field diagnostic on the GPU.
GPU 7’s NvLink link 1 is currently down Run a field diagnostic on the GPU.

$nvidia-smi nvlink -s

ScottEllis · December 15, 2023, 4:53pm

Hi @asdewq45445 !

The answer to does “…GPU memory combines through NVLink’s shared memory mechanism during program execution to provide an effective VRAM of 128GB (16GB * 8), or if each GPU is limited to a maximum of 16GB only.” is the latter. Each GPU has local memory, so in general allocations need to fit within that size for each GPU. This can be frustrating as models and data sizes grow, and one of the reasons you see more memory on newer NVIDIA GPUs.

There are some ways that GPUs can access the memory of others (or system memory), but it doesn’t really get you to “I have 8*16GB of GPU RAM” that you’re looking for at the Pytorch/etc. level.

The two GPUs with the NVLink connection that’s down is a different problem, and is symptomatic of a hardware issue.

ScottE

Topic		Replies	Views
NVlink memory CUDA Programming and Performance	3	3660	December 11, 2018
NvLink (V100) GPU - Hardware	4	1855	October 12, 2021
NVLINK support for connecting 4 GPUs GPU - Hardware	9	8496	May 29, 2023
Can I increase GPU memory using Nvlink? Deep Learning (Training & Inference)	0	609	October 6, 2020
NVIDIA DGX-1: The Fastest Deep Learning System Technical Blog	2	477	April 17, 2020
Nvidia-smi Memory-Usage of different GPUs always same CUDA Programming and Performance cuda	9	577	March 28, 2024
How to balance nvlink CUDA Programming and Performance	8	874	April 27, 2024
Is possible multiples GPUs work as one with more memory via NVlink? cuDNN	2	3110	April 27, 2021
Optimal multi-GPU system CUDA Programming and Performance	7	1077	September 6, 2017
Tesla V100-DGXS-32GB NVLink Compatibility DGX User Forum	0	76	December 30, 2024

DGX-1 P100(16GB*8) VRAM Question: 16GB or 128GB?

Related topics