How to disable one of the GPUs

beeth · March 25, 2015, 2:45pm

I found one of the Tesla GPUs went bad, and I would like to disable it:

nvidia-smi

I got the following errors in system log:
kernel: NVRM: Xid (PCI:0000:04:00): 58, Edc 00000004
kernel: NVRM: Xid (PCI:0000:04:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU (00 04 00).
kernel: NVRM: Xid (PCI:0000:04:00): 45, Ch 00000001, engmsk 00000100

Is there a way I can disable GPU#0(Bus-ID: 0000:04:00.0) from O/S(RHEL5)?
Thank you!

beeth · March 25, 2015, 2:54pm

And if it’s possible to disable the GPUs, do I have to disable a pair of them, say 2 of 4? Thank again!

njuffa · March 25, 2015, 3:09pm

[url]http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/[/url]

beeth · March 25, 2015, 3:19pm

Thanks for the quick reply, however, is there a way to do it from O/S level? Because we use third-party software so we can’t control the code. Please advise. Thanks again!

njuffa · March 25, 2015, 3:37pm

Not sure what you mean by “OS level”. CUDA_VISIBLE_DEVICES is an environment variable that you can set from the console, prior to starting your app.

beeth · March 25, 2015, 3:59pm

I see, however, I wonder if we can disable the GPU so that nvidia-smi could only show 3 out of 4. Because we use Grid Engine, we need to set “gpu=3” instead of “gpu=4”. Thanks again!

beeth · March 25, 2015, 4:00pm

I see, however, I wonder if we can disable the GPU so that nvidia-smi could only show 3 out of 4. Because we use Grid Engine, we need to set “gpu=3” instead of “gpu=4”. Thanks again!

beeth · March 25, 2015, 4:02pm

I see, however, I wonder if we can disable the GPU so that nvidia-smi could only show 3 out of 4. Because we use Grid Engine, we need to set “gpu=3” instead of “gpu=4”. Thanks again!

njuffa · March 25, 2015, 4:06pm

nvidia-smi’s purpose is precisely to provide low-level access and control, so I do not know that one can hide much of anything from it, nor does such an approach seem to make much sense to me.

I do not know, and thus have no experience with, “Grid Engine” and how it interacts with nvidia-smi. Is Grid Engine an NVIDIA product? If not, you may want to seek assistance from the vendor of Grid Engine.

There seems to be a new feature in CUDA 7.0 that may help with your scenario (I have no experience with it). The release notes describe it as follows:

[url]http://docs.nvidia.com/cuda/cuda-toolkit-release-notes[/url]
“Instrumented NVML (NVIDIA Management Library) and the CUDA driver to ignore GPUs that have been made inaccessible via cgroups (control groups). This enables schedulers that rely on cgroups to enforce device access restrictions for their jobs. Job schedulers wanting to use cgroups for device restriction need CUDA and NVML to handle those restrictions in a graceful way.”

beeth · March 25, 2015, 4:10pm

I see, then I may have to remove the bad card from the box, or disable it from BIOS. Thanks again!

njuffa · March 25, 2015, 4:14pm

See my update regarding cgroup support added in CUDA 7.0. Maybe that can help?

tbenson · March 25, 2015, 8:34pm

I have successfully made GPUs invisible to nvidia-smi in the past by “removing” them from the PCIe bus via:

echo 1 > /sys/bus/pci/devices/0000:XX:00.0/remove

where XX is the address shown as the bus id in nvidia-smi. If that works, then you could drop it into rc.local or its systemd equivalent to remove it after boot.

Topic		Replies	Views
Two GPUs, but 2nd GPU not detected. How to fix? CUDA Setup and Installation	10	15604	January 21, 2018
Missing GPU Linux	5	1855	October 12, 2021
CUDA accessing ALL devices, even those which are blacklisted CUDA Programming and Performance	9	7413	October 17, 2014
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	562	September 11, 2024
GPU utilization broken in CUDA-4.0 Is patch available? CUDA Programming and Performance	2	2876	August 8, 2011
RESOLVED!!! \| GPU missing from nvidia-smi but seen in lspci CUDA Setup and Installation	9	13350	April 11, 2024
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62515	February 14, 2021
GPUs temporary disappear during runtime (driver 384.59) CUDA Setup and Installation	1	976	September 1, 2017
2nd GPU not showing in nvidia-smi in Ubuntu 22.04 Linux	4	7035	June 2, 2024
Tesla k40m CUDA Setup and Installation	6	1790	December 8, 2023

How to disable one of the GPUs

nvidia-smi

Related topics