Only 1 K80 device appearing in Ubuntu VM

ar1stotl3 · April 18, 2019, 2:20am

Hey all. Hoping someone has an idea here. I’m trying to use my 3 Tesla K80s with some VMs on a server running Unraid on a Dell R740xd. However, it’s acting very strangely. The host can see all of the devices, but nvidia-smi on the guest VM only shows 1.

With an Unraid terminal, I see the following:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
root@server:~# lspci | grep 3D
3d:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
3e:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
b1:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
b2:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
da:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
db:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

Inside of my Ubuntu VM where I’d like to utilize the GPU, nvidia-smi only shows 1 device:
root@gpu1:~$ nvidia-smi
Thu Apr 18 02:14:45 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43 Driver Version: 418.43 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:06:00.0 Off | 0 |
| N/A 39C P0 64W / 149W | 0MiB / 11441MiB | 100% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
root@gpu1:~$ lspci | grep NV
06:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
07:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
08:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
09:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0a:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0b:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

Any ideas? It looks like everything is there, so I’m not sure why nvidia-smi is only showing a single device available.

TIA!

njuffa · April 18, 2019, 3:14am

I don’t know what Unraid is.

Have you checked whether the VM has a configuration parameter that controls how many GPUs are visible to the guest?

Also, check for the presence of a environment variable CUDA_VISIBLE_DEVICES.

ar1stotl3 · April 18, 2019, 1:50pm

Unraid is basically a hypervisor that lets you configure and deploy the VMs.

As far as any parameters, I believe that assigning all 6 GPUs to the guest did that from the hypervisor side, and it seems like it at least partially worked given the lspci output. However, neither have CUDA_VISIBLE_DEVICES set.

I think I may have just found the problem. I noticed the following in dmesg:

[ 9.283359] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:07:00.0)
[ 9.283360] NVRM: The system BIOS may have misconfigured your GPU.
[ 9.283383] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:08:00.0)
[ 9.283384] NVRM: The system BIOS may have misconfigured your GPU.
[ 9.283405] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:09:00.0)
[ 9.283406] NVRM: The system BIOS may have misconfigured your GPU.
[ 9.283430] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:0a:00.0)
[ 9.283431] NVRM: The system BIOS may have misconfigured your GPU.
[ 9.283452] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 9.283453] NVRM: The system BIOS may have misconfigured your GPU.
[ 9.283465] NVRM: The NVIDIA probe routine failed for 5 device(s).
[ 9.283465] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 418.43 Tue Feb 19 01:12:11 CST 2019

So I guess I need to figure out why this is happening…

njuffa · April 18, 2019, 4:30pm

I agree that looks like a smoking gun. Especially for Tesla GPUs, BAR0 (a PCIe base address register) requires a fairly large chunk of memory. Check the BIOS settings for relevant configuration settings. Some BIOSes may not provide sufficient flexibility to support six Tesla GPUs.

ar1stotl3 · April 18, 2019, 7:06pm

Good news! I got it working, but I’m not sure why this works. You mentioned memory and BIOS. I had given this VM 16gb of memory… not sure if that would have been a limiting factor. I created a new VM, and in Unraid I can choose between 2 BIOSs: OVMF and SeaBIOS. The first VM was using OVMF.

So I recreated the VM with SeaBIOS and gave it 180GB of memory and all of the GPUs are available!

I tried just giving more memory to the OVMF version, but that didn’t work. SeaBIOS it is.

Topic		Replies	Views
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid Linux	39	16715	October 12, 2021
Tesla K80 Installation Issue CUDA Setup and Installation	2	1899	August 31, 2020
Tesla K80 detected on OpenSuse 15.5, but nvidia-smi couldn't communicate with the NVIDIA driver Linux driver	8	1244	June 18, 2023
Interpreting nvidia-smi output CUDA Setup and Installation	5	4948	May 19, 2021
Troubleshooting Tesla K80 on Dell PowerEdge R810 running Ubuntu 20.04 CUDA Setup and Installation cuda , ubuntu	1	1288	February 15, 2021
Driver Installation for Tesla K80 - Problems CUDA Setup and Installation	17	6115	January 18, 2020
Problem getting Nvidea K40 running under Ubuntu 21.10 Linux	11	646	April 12, 2022
nvidia-smi is not detecting one card out of three CUDA Setup and Installation	6	10386	March 8, 2016
Ubuntu Box with multiple NVIDIA GPU Cards CUDA Setup and Installation	5	11797	August 13, 2015
RESOLVED!!! \| GPU missing from nvidia-smi but seen in lspci CUDA Setup and Installation	9	12166	April 11, 2024

Only 1 K80 device appearing in Ubuntu VM

Related topics