Data being sent to both GPUs despite only selecting one

marcbaltes98 · March 22, 2024, 2:53pm

Specs (I have the full spec sheet we used to purchase the machine if needed):

OS: Red Hat Enterprise Linux version 8.9 (Ootpa)
GPU: 2x NVIDIA RTX A6000
NVIDIA SMI info: NVIDIA-SMI 545.23.08, Driver Version: 545.23.08, CUDA Version: 12.3

Hi, we recently ordered and received a new machine that has 2 NVIDIA RTX A6000 cards which gives us 96GB of VRAM to work with. However, we noticed there is a weird issue with how the memory is being allocated onto the GPUs.

I was working on some experiments in PyTorch and bumped my batch size up high enough to use about 32/48GB on GPU 0. Upon doing so, I saw that for some reason, both GPUs were allocating the same amount of memory. To make sure it was not a visual glitch with the nvidia-smi command, I ran the same model at the same time but on GPU 1 and ran out of memory. I should have been able to run the model on both GPUs at the same time considering we have in total 96GB of VRAM but it looks like the data is for some reason being copied to both GPUs.

I was not sure if this was a PyTorch issue or not so I put together a small script to test:

import numpy as np
from numba import cuda

cuda.select_device(0)
data = np.ones(1000000000)
d_data = cuda.to_device(data)

while True:
    a = 1

All this code does is send data to the specified GPU in the cuda.select_device() line (in our case either device 0 or 1) and hang until the user quits.
After doing so, I was able to capture screenshots of each GPUs memory allocation using nvtop:

The yellow line in each image represents the memory allocation for that GPU. The first image shows when cuda.select_device(0) was used and the second image used cuda.select_device(1). You can see that in each case, the data is being set to both GPUs even though in the code we only wanted to select one. I have also tried using the CUDA_VISIBLE_DEVICES environment variable. Setting it to 0 causes a segmentation fault with no further output. Setting it to 1 works but is still putting the data onto both GPUs. Setting it to anything else does not work and says that no cuda devices are available.

We have never seen anything like this before. Additionally, none of us have worked with Red Hat Linux either. We were wondering if this is an OS issue or possibly a hardware issue. In either case, this is bug is only letting us use 48/96GB of VRAM in the machine because the data is copied to both GPUs. If anyone has any insight on how to go about diagnosing and fixing this issue it would be much appreciated, thank you!

MarkusHoHo · April 8, 2024, 11:38am

Hello @marcbaltes98 and welcome to the NVIDIA developer forums.

I am sorry to hear that you have this kind of issue and apologies for the late reply.

I recommend we move this post to the CUDA forums, but first of all I would like to ask you to try out a CUDA only sample app that uses the actual CUDA interface of cudaSetDevice(). Just to rule out any issues with Numba on Red Hat multi GPU installations. I am not familiar enough with CUDA to recomend anything specific, but you can check out the CUDA samples on Github.

Then the best forum for this would be CUDA Programming and Performance - NVIDIA Developer Forums

Thanks!

Topic		Replies	Views
Data being sent to both GPUs despite only selecting one CUDA Programming and Performance	17	503	March 25, 2024
Can't allocate gpu memory to multiple gpus while training CUDA Programming and Performance	0	432	January 2, 2019
Trouble Selecting single GPU on dual GPU system CUDA Setup and Installation cuda	0	299	December 14, 2023
CUDA Multi GPU memory management CUDA Programming and Performance	0	595	April 13, 2023
Out of CUDA memory when running torch project. CUDA Programming and Performance	2	680	January 10, 2019
Multi-GPU Memory Allocation behaves differently with different order of allocation CUDA Programming and Performance	1	780	June 15, 2013
Memory allocation problem with multi-gpu (Tesla k80), possible cuda driver bug CUDA Programming and Performance	5	4070	February 20, 2016
Multiple GPUs not working CUDA Programming and Performance	1	825	July 9, 2009
System uses only 1 out of 4 GPUs at a time on Azure NC instance CUDA Setup and Installation	6	2103	January 1, 2018
CudaMalloc fails when more of 2 linux process acces to the GPU 0 CUDA Programming and Performance	2	1148	February 24, 2009

Data being sent to both GPUs despite only selecting one

Related topics