Only 1 K80 device appearing in Ubuntu VM

Hey all. Hoping someone has an idea here. I’m trying to use my 3 Tesla K80s with some VMs on a server running Unraid on a Dell R740xd. However, it’s acting very strangely. The host can see all of the devices, but nvidia-smi on the guest VM only shows 1.

With an Unraid terminal, I see the following:

root@server:~# nvidia-smi
Wed Apr 17 18:27:37 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43 Driver Version: 418.43 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:3D:00.0 Off | 0 |
| N/A 44C P0 57W / 149W | 0MiB / 11441MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla K80 Off | 00000000:3E:00.0 Off | 0 |
| N/A 34C P0 71W / 149W | 0MiB / 11441MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla K80 Off | 00000000:B1:00.0 Off | 0 |
| N/A 41C P0 56W / 149W | 0MiB / 11441MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla K80 Off | 00000000:B2:00.0 Off | 0 |
| N/A 33C P0 71W / 149W | 0MiB / 11441MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 4 Tesla K80 Off | 00000000:DA:00.0 Off | 0 |
| N/A 43C P0 58W / 149W | 0MiB / 11441MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 5 Tesla K80 Off | 00000000:DB:00.0 Off | 0 |
| N/A 31C P0 70W / 149W | 0MiB / 11441MiB | 69% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
root@server:~# lspci | grep 3D
3d:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
3e:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
b1:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
b2:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
da:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
db:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

Inside of my Ubuntu VM where I’d like to utilize the GPU, nvidia-smi only shows 1 device:
root@gpu1:~$ nvidia-smi
Thu Apr 18 02:14:45 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43 Driver Version: 418.43 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:06:00.0 Off | 0 |
| N/A 39C P0 64W / 149W | 0MiB / 11441MiB | 100% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
root@gpu1:~$ lspci | grep NV
06:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
07:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
08:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
09:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0a:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0b:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

Any ideas? It looks like everything is there, so I’m not sure why nvidia-smi is only showing a single device available.

TIA!

I don’t know what Unraid is.

Have you checked whether the VM has a configuration parameter that controls how many GPUs are visible to the guest?

Also, check for the presence of a environment variable CUDA_VISIBLE_DEVICES.

Unraid is basically a hypervisor that lets you configure and deploy the VMs.

As far as any parameters, I believe that assigning all 6 GPUs to the guest did that from the hypervisor side, and it seems like it at least partially worked given the lspci output. However, neither have CUDA_VISIBLE_DEVICES set.

I think I may have just found the problem. I noticed the following in dmesg:

[ 9.283359] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:07:00.0)
[ 9.283360] NVRM: The system BIOS may have misconfigured your GPU.
[ 9.283383] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:08:00.0)
[ 9.283384] NVRM: The system BIOS may have misconfigured your GPU.
[ 9.283405] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:09:00.0)
[ 9.283406] NVRM: The system BIOS may have misconfigured your GPU.
[ 9.283430] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:0a:00.0)
[ 9.283431] NVRM: The system BIOS may have misconfigured your GPU.
[ 9.283452] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 9.283453] NVRM: The system BIOS may have misconfigured your GPU.
[ 9.283465] NVRM: The NVIDIA probe routine failed for 5 device(s).
[ 9.283465] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 418.43 Tue Feb 19 01:12:11 CST 2019

So I guess I need to figure out why this is happening…

I agree that looks like a smoking gun. Especially for Tesla GPUs, BAR0 (a PCIe base address register) requires a fairly large chunk of memory. Check the BIOS settings for relevant configuration settings. Some BIOSes may not provide sufficient flexibility to support six Tesla GPUs.

Good news! I got it working, but I’m not sure why this works. You mentioned memory and BIOS. I had given this VM 16gb of memory… not sure if that would have been a limiting factor. I created a new VM, and in Unraid I can choose between 2 BIOSs: OVMF and SeaBIOS. The first VM was using OVMF.

So I recreated the VM with SeaBIOS and gave it 180GB of memory and all of the GPUs are available!

I tried just giving more memory to the OVMF version, but that didn’t work. SeaBIOS it is.