RESOLVED!!! | GPU missing from nvidia-smi but seen in lspci

I’m running on Ubuntu 18.04, with 8x Tesla V100 SXM2 32GB. I had all 8x GPU up and running; seen all 8 in nvidia-smi and lspc. Now today I only see two GPU on slot 5 and 7 of my system. I’ve tried all troubling shooting step and believe it to be software in nature. This is happening on two of my 6 nodes, all homogeneous.

root@znode48:~# uname -a
Linux znode48 4.15.0-88-generic #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

root@znode48:~# lspci | grep -i nvidia
15:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
16:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
3a:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
3b:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
89:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
8a:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
b2:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
b3:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)

root@znode48:~# nvidia-smi
Mon Mar 2 16:40:50 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000000:3A:00.0 Off | 0 |
| N/A 32C P0 42W / 300W | 0MiB / 32510MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000000:3B:00.0 Off | 0 |
| N/A 32C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

root@znode48:~# dmesg | grep -i nvidia
[ 5.840912] nvidia: module license ‘NVIDIA’ taints kernel.
[ 6.060696] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[ 6.098954] nvidia 0000:15:00.0: enabling device (0140 -> 0142)
[ 6.123568] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.174247] nvidia: probe of 0000:15:00.0 failed with error -1
[ 6.194921] nvidia 0000:16:00.0: enabling device (0140 -> 0142)
[ 6.205164] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.235924] nvidia: probe of 0000:16:00.0 failed with error -1
[ 6.246095] nvidia 0000:3a:00.0: enabling device (0140 -> 0142)
[ 6.317316] nvidia 0000:3b:00.0: enabling device (0140 -> 0142)
[ 6.369590] nvidia 0000:89:00.0: enabling device (0140 -> 0142)
[ 6.369662] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.369675] nvidia: probe of 0000:89:00.0 failed with error -1
[ 6.369692] nvidia 0000:8a:00.0: enabling device (0140 -> 0142)
[ 6.369731] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.369739] nvidia: probe of 0000:8a:00.0 failed with error -1
[ 6.369760] nvidia 0000:b2:00.0: enabling device (0140 -> 0142)
[ 6.369793] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.369800] nvidia: probe of 0000:b2:00.0 failed with error -1
[ 6.369812] nvidia 0000:b3:00.0: enabling device (0140 -> 0142)
[ 6.369843] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.369849] nvidia: probe of 0000:b3:00.0 failed with error -1
[ 6.369870] NVRM: The NVIDIA probe routine failed for 6 device(s).
[ 6.369871] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.33.01 Wed Nov 13 00:00:22 UTC 2019
[ 6.377600] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 440.33.01 Tue Nov 12 23:43:11 UTC 2019
[ 6.516842] [drm] [nvidia-drm] [GPU ID 0x00003a00] Loading driver
[ 6.516861] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:3a:00.0 on minor 1
[ 6.516964] [drm] [nvidia-drm] [GPU ID 0x00003b00] Loading driver
[ 6.516981] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:3b:00.0 on minor 2
[ 10.676097] nvidia-uvm: Loaded the UVM driver, major device number 510.

root@znode48:~# lsmod | grep -i nvidia
nvidia_uvm 929792 0
nvidia_drm 45056 0
nvidia_modeset 1110016 1 nvidia_drm
nvidia 19890176 29 nvidia_uvm,nvidia_modeset
drm_kms_helper 172032 2 mgag200,nvidia_drm
drm 401408 5 drm_kms_helper,mgag200,nvidia_drm,ttm
ipmi_msghandler 53248 4 ipmi_devintf,ipmi_si,nvidia,ipmi_ssif

What (hardware or software) changes were made to the system since then? Undo those changes and see whether everything works as intended after that.

Is this a fully-configured system you acquired either NVIDIA or from an NVIDIA-approved system integrator (https://www.nvidia.com/en-us/data-center/tesla/tesla-qualified-servers-catalog/). Did you assemble or upgrade the system by yourself?

I assume Tesla “V100 SXM2 32GB” is what NVIDIA calls “NVIDIA V100 FOR NVLINK”? One of which draws 300W? If you suspect general flakiness, the first thing I would check is power supply. What does the rest of the platform look like, in terms of CPUs and system memory? What does the power supply solution look like? Presumably multiple PSUs are used. I would suggest use of 80 PLUS Titanium compliant PSUs.

For rock-solid operation I recommend using no more than ~60% of a PSU’s nominal wattage, which would suggest (making some reasonable assumptions about your overall system) that your system would require a total power supply of 4000W. If you are adventurous, you might get things to run with 3200W of power supply.

The last thing was ran docker run --gpus device=’“0,1”’ etc.
This shouldn’t have any affect outside of the image, only inside the container.

I assume Tesla “V100 SXM2 32GB” is what NVIDIA calls “NVIDIA V100 FOR NVLINK”?
Correct.

2x intel 6248 2.50GHz 20c
24x 32GB 2933MHz DDR4 RDIMM (768GB Total)
4x 2200 Watts powersupply

OK, so that system is a bit bigger than I had imagined. The sum of all components runs to about 3100W. Not sure how the PSUs are configured. I am assuming 2x2200W active with 2x2200W as hot spares? In which case the ratio of nominal system wattage to nominal PSU wattage would be 0.7, which seems OK but not ideal in my book (I am probably more conservative than most, though).

If you bought your system fully-configured from an NVIDIA-approved system integrator, you would definitely want to seek assistance from them. Otherwise you are on your own. There are various caveats when self-assembling a huge Tesla-based system like this, and I certainly have zero practical experience with doing that.

Generic advice: Check system BIOS settings, check power supply, check cooling. This system is a veritable space heater and requires massive airflow for cooling unless you use water cooling.

So I have 4 of the same system, all homogeneous. Only two of the nodes are acting like this and the other two are fine. Same hardware configuration, and software stack. All bios settings are same as with the version, kernel same, release same, and all the cards are same vbios.

Since you presumably spent something like 4x $80K on these systems, your first point of contact should be the system vendor who can help you resolve these discrepancies.

Issue resolved!
I set my dip switch 6 on, which will reset the NVRAM.