How to disable one of the GPUs

I found one of the Tesla GPUs went bad, and I would like to disable it:

nvidia-smi

Wed Mar 25 10:21:34 2015
±-----------------------------------------------------+
| NVIDIA-SMI 340.65 Driver Version: 340.65 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M2050 Off | 0000:04:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 6MiB / 2687MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla M2050 Off | 0000:05:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 6MiB / 2687MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla M2050 Off | 0000:08:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 6MiB / 2687MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla M2050 Off | 0000:09:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 6MiB / 2687MiB | 0% Default |
±------------------------------±---------------------±---------------------+

I got the following errors in system log:
kernel: NVRM: Xid (PCI:0000:04:00): 58, Edc 00000004
kernel: NVRM: Xid (PCI:0000:04:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU (00 04 00).
kernel: NVRM: Xid (PCI:0000:04:00): 45, Ch 00000001, engmsk 00000100

Is there a way I can disable GPU#0(Bus-ID: 0000:04:00.0) from O/S(RHEL5)?
Thank you!

And if it’s possible to disable the GPUs, do I have to disable a pair of them, say 2 of 4? Thank again!

[url]http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/[/url]

Thanks for the quick reply, however, is there a way to do it from O/S level? Because we use third-party software so we can’t control the code. Please advise. Thanks again!

Not sure what you mean by “OS level”. CUDA_VISIBLE_DEVICES is an environment variable that you can set from the console, prior to starting your app.

I see, however, I wonder if we can disable the GPU so that nvidia-smi could only show 3 out of 4. Because we use Grid Engine, we need to set “gpu=3” instead of “gpu=4”. Thanks again!

I see, however, I wonder if we can disable the GPU so that nvidia-smi could only show 3 out of 4. Because we use Grid Engine, we need to set “gpu=3” instead of “gpu=4”. Thanks again!

I see, however, I wonder if we can disable the GPU so that nvidia-smi could only show 3 out of 4. Because we use Grid Engine, we need to set “gpu=3” instead of “gpu=4”. Thanks again!

nvidia-smi’s purpose is precisely to provide low-level access and control, so I do not know that one can hide much of anything from it, nor does such an approach seem to make much sense to me.

I do not know, and thus have no experience with, “Grid Engine” and how it interacts with nvidia-smi. Is Grid Engine an NVIDIA product? If not, you may want to seek assistance from the vendor of Grid Engine.

There seems to be a new feature in CUDA 7.0 that may help with your scenario (I have no experience with it). The release notes describe it as follows:

[url]http://docs.nvidia.com/cuda/cuda-toolkit-release-notes[/url]
“Instrumented NVML (NVIDIA Management Library) and the CUDA driver to ignore GPUs that have been made inaccessible via cgroups (control groups). This enables schedulers that rely on cgroups to enforce device access restrictions for their jobs. Job schedulers wanting to use cgroups for device restriction need CUDA and NVML to handle those restrictions in a graceful way.”

I see, then I may have to remove the bad card from the box, or disable it from BIOS. Thanks again!

See my update regarding cgroup support added in CUDA 7.0. Maybe that can help?

I have successfully made GPUs invisible to nvidia-smi in the past by “removing” them from the PCIe bus via:

echo 1 > /sys/bus/pci/devices/0000:XX:00.0/remove

where XX is the address shown as the bus id in nvidia-smi. If that works, then you could drop it into rc.local or its systemd equivalent to remove it after boot.