Dear NVIDIA Support and Community:
We’re having challenges with our 8xK80 setup when installing the CUDA driver. Even on a brand new Ubuntu installation and despite a clean CUDA installation and clean make compilation of samples, the samples hang when running. Running nvidia-smi hangs the computer. Curiously, the GPU information under /proc/driver/nvidia/gpus/*/information lists unknown IDS: GPU UUID: GPU-???-???-???-???-??? and Video BIOS: ??.??.??.??.??
We’ve now tried this installation five times each time with a fresh Ubuntu wipe+installation. The saving grace is that one of the installations worked fine, suggesting that the hardware is fine and there are probably no bios-level changes required. The other four attempts faced the above challenges. Each installation was done in the same way.
We’ve done a multi GPU setup before with no issues, the difference here is a different motherboard (TYAN FT77C-B7079) and Tesla 8x K80s instead of 3x TitanXs. However, this was a hardware combination we vetted with Nvidia before our substantial investment.
We could not find any prior cases exactly similar, but this one is closest: https://devtalk.nvidia.com/default/topic/793760/linux/driver-cannot-detect-third-gtx980-using-346-16-on-ubuntu-quot-error-5-quot-in-logs Though in our case, the hang is without any outputs.
Our complete setup details are below, but I had some questions:
- Would bios level changes be required for such a setup? I’m hesitant to make Bios changes (a la https://devtalk.nvidia.com/default/topic/793760/linux/driver-cannot-detect-third-gtx980-using-346-16-on-ubuntu-quot-error-5-quot-in-logs) without good cause, lest we break things further
- Is there a unique installation procedure for multi-Tesla setups?
- Is there a way to get some verbose output from nvidia-smi so we can at least get an indication of where things are hanging?
Thanks, Saif in Brooklyn
Operating System:
Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-27-generic x86_64)
Full Signature:
saif@DeepHorizon:~$ uname -a
Linux DeepHorizon 4.2.0-27-generic #32~14.04.1-Ubuntu SMP Fri Jan 22 15:32:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Driver Installed:
cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
Obtained from: CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer
The Motherboard is a TYAN FT77C-B7079:
saif@DeepHorizon:~$ sudo dmidecode -t 2
# dmidecode 2.12
SMBIOS 3.0 present.
# SMBIOS implementations newer than version 2.8 are not
# fully supported by this version of dmidecode.
Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: TYAN
Product Name: FT77C-B7079
Version: empty
Serial Number: *******
Asset Tag: empty
Features:
Board is a hosting board
Board is replaceable
Location In Chassis: empty
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0
Output of “cat /proc/driver/nvidia/gpus/*/information”:
saif@DeepHorizon:~$ cat /proc/driver/nvidia/gpus/*/information
Model: Tesla K80
IRQ: 26
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:06:00.0
Model: Tesla K80
IRQ: 26
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:07:00.0
Model: Tesla K80
IRQ: 26
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:0a:00.0
Model: Tesla K80
IRQ: 26
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:0b:00.0
Model: Tesla K80
IRQ: 28
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:10:00.0
Model: Tesla K80
IRQ: 28
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:11:00.0
Model: Tesla K80
IRQ: 28
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:14:00.0
Model: Tesla K80
IRQ: 28
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:15:00.0
Model: Tesla K80
IRQ: 51
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:85:00.0
Model: Tesla K80
IRQ: 51
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:86:00.0
Model: Tesla K80
IRQ: 51
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:89:00.0
Model: Tesla K80
IRQ: 51
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:8a:00.0
Model: Tesla K80
IRQ: 53
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:8f:00.0
Model: Tesla K80
IRQ: 53
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:90:00.0
Model: Tesla K80
IRQ: 53
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:93:00.0
Model: Tesla K80
IRQ: 53
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 36 bits
DMA Mask: 0xfffffffff
Bus Location: 0000:94:00.0