I’m running on Ubuntu 18.04, with 8x Tesla V100 SXM2 32GB. I had all 8x GPU up and running; seen all 8 in nvidia-smi and lspc. Now today I only see two GPU on slot 5 and 7 of my system. I’ve tried all troubling shooting step and believe it to be software in nature. This is happening on two of my 6 nodes, all homogeneous.
root@znode48:~# uname -a
Linux znode48 4.15.0-88-generic #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
root@znode48:~# lspci | grep -i nvidia
15:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
16:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
3a:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
3b:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
89:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
8a:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
b2:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
b3:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
root@znode48:~# nvidia-smi
Mon Mar 2 16:40:50 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000000:3A:00.0 Off | 0 |
| N/A 32C P0 42W / 300W | 0MiB / 32510MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000000:3B:00.0 Off | 0 |
| N/A 32C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
root@znode48:~# dmesg | grep -i nvidia
[ 5.840912] nvidia: module license ‘NVIDIA’ taints kernel.
[ 6.060696] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[ 6.098954] nvidia 0000:15:00.0: enabling device (0140 → 0142)
[ 6.123568] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.174247] nvidia: probe of 0000:15:00.0 failed with error -1
[ 6.194921] nvidia 0000:16:00.0: enabling device (0140 → 0142)
[ 6.205164] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.235924] nvidia: probe of 0000:16:00.0 failed with error -1
[ 6.246095] nvidia 0000:3a:00.0: enabling device (0140 → 0142)
[ 6.317316] nvidia 0000:3b:00.0: enabling device (0140 → 0142)
[ 6.369590] nvidia 0000:89:00.0: enabling device (0140 → 0142)
[ 6.369662] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.369675] nvidia: probe of 0000:89:00.0 failed with error -1
[ 6.369692] nvidia 0000:8a:00.0: enabling device (0140 → 0142)
[ 6.369731] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.369739] nvidia: probe of 0000:8a:00.0 failed with error -1
[ 6.369760] nvidia 0000:b2:00.0: enabling device (0140 → 0142)
[ 6.369793] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.369800] nvidia: probe of 0000:b2:00.0 failed with error -1
[ 6.369812] nvidia 0000:b3:00.0: enabling device (0140 → 0142)
[ 6.369843] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 6.369849] nvidia: probe of 0000:b3:00.0 failed with error -1
[ 6.369870] NVRM: The NVIDIA probe routine failed for 6 device(s).
[ 6.369871] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.33.01 Wed Nov 13 00:00:22 UTC 2019
[ 6.377600] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 440.33.01 Tue Nov 12 23:43:11 UTC 2019
[ 6.516842] [drm] [nvidia-drm] [GPU ID 0x00003a00] Loading driver
[ 6.516861] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:3a:00.0 on minor 1
[ 6.516964] [drm] [nvidia-drm] [GPU ID 0x00003b00] Loading driver
[ 6.516981] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:3b:00.0 on minor 2
[ 10.676097] nvidia-uvm: Loaded the UVM driver, major device number 510.
I assume Tesla “V100 SXM2 32GB” is what NVIDIA calls “NVIDIA V100 FOR NVLINK”? One of which draws 300W? If you suspect general flakiness, the first thing I would check is power supply. What does the rest of the platform look like, in terms of CPUs and system memory? What does the power supply solution look like? Presumably multiple PSUs are used. I would suggest use of 80 PLUS Titanium compliant PSUs.
For rock-solid operation I recommend using no more than ~60% of a PSU’s nominal wattage, which would suggest (making some reasonable assumptions about your overall system) that your system would require a total power supply of 4000W. If you are adventurous, you might get things to run with 3200W of power supply.
OK, so that system is a bit bigger than I had imagined. The sum of all components runs to about 3100W. Not sure how the PSUs are configured. I am assuming 2x2200W active with 2x2200W as hot spares? In which case the ratio of nominal system wattage to nominal PSU wattage would be 0.7, which seems OK but not ideal in my book (I am probably more conservative than most, though).
If you bought your system fully-configured from an NVIDIA-approved system integrator, you would definitely want to seek assistance from them. Otherwise you are on your own. There are various caveats when self-assembling a huge Tesla-based system like this, and I certainly have zero practical experience with doing that.
Generic advice: Check system BIOS settings, check power supply, check cooling. This system is a veritable space heater and requires massive airflow for cooling unless you use water cooling.
So I have 4 of the same system, all homogeneous. Only two of the nodes are acting like this and the other two are fine. Same hardware configuration, and software stack. All bios settings are same as with the version, kernel same, release same, and all the cards are same vbios.
Since you presumably spent something like 4x $80K on these systems, your first point of contact should be the system vendor who can help you resolve these discrepancies.
I have the same problem on my workstation when running 4x A100, on lspci I see all 4 but on nvidia-smi only 2 show up, can you please help me with this.