Total Hang on nvidia-smi and CUDA Sample Runs on Multi k80 System

Dear NVIDIA Support and Community:

We’re having challenges with our 8xK80 setup when installing the CUDA driver. Even on a brand new Ubuntu installation and despite a clean CUDA installation and clean make compilation of samples, the samples hang when running. Running nvidia-smi hangs the computer. Curiously, the GPU information under /proc/driver/nvidia/gpus/*/information lists unknown IDS: GPU UUID: GPU-???-???-???-???-??? and Video BIOS: ??.??.??.??.??

We’ve now tried this installation five times each time with a fresh Ubuntu wipe+installation. The saving grace is that one of the installations worked fine, suggesting that the hardware is fine and there are probably no bios-level changes required. The other four attempts faced the above challenges. Each installation was done in the same way.

We’ve done a multi GPU setup before with no issues, the difference here is a different motherboard (TYAN FT77C-B7079) and Tesla 8x K80s instead of 3x TitanXs. However, this was a hardware combination we vetted with Nvidia before our substantial investment.

We could not find any prior cases exactly similar, but this one is closest: https://devtalk.nvidia.com/default/topic/793760/linux/driver-cannot-detect-third-gtx980-using-346-16-on-ubuntu-quot-error-5-quot-in-logs Though in our case, the hang is without any outputs.

Our complete setup details are below, but I had some questions:

Thanks, Saif in Brooklyn

Operating System:
Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-27-generic x86_64)

Full Signature:
saif@DeepHorizon:~$ uname -a
Linux DeepHorizon 4.2.0-27-generic #32~14.04.1-Ubuntu SMP Fri Jan 22 15:32:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Driver Installed:
cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
Obtained from: CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer

The Motherboard is a TYAN FT77C-B7079:

saif@DeepHorizon:~$ sudo dmidecode -t 2
# dmidecode 2.12
SMBIOS 3.0 present.
# SMBIOS implementations newer than version 2.8 are not
# fully supported by this version of dmidecode.
Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
	Manufacturer: TYAN
	Product Name: FT77C-B7079
	Version: empty
	Serial Number: *******
	Asset Tag: empty
	Features:
		Board is a hosting board
		Board is replaceable
	Location In Chassis: empty
	Chassis Handle: 0x0003
	Type: Motherboard
	Contained Object Handles: 0

Output of “cat /proc/driver/nvidia/gpus/*/information”:

saif@DeepHorizon:~$ cat /proc/driver/nvidia/gpus/*/information
Model: 		 Tesla K80
IRQ:   		 26
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:06:00.0
Model: 		 Tesla K80
IRQ:   		 26
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:07:00.0
Model: 		 Tesla K80
IRQ:   		 26
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:0a:00.0
Model: 		 Tesla K80
IRQ:   		 26
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:0b:00.0
Model: 		 Tesla K80
IRQ:   		 28
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:10:00.0
Model: 		 Tesla K80
IRQ:   		 28
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:11:00.0
Model: 		 Tesla K80
IRQ:   		 28
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:14:00.0
Model: 		 Tesla K80
IRQ:   		 28
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:15:00.0
Model: 		 Tesla K80
IRQ:   		 51
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:85:00.0
Model: 		 Tesla K80
IRQ:   		 51
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:86:00.0
Model: 		 Tesla K80
IRQ:   		 51
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:89:00.0
Model: 		 Tesla K80
IRQ:   		 51
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:8a:00.0
Model: 		 Tesla K80
IRQ:   		 53
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:8f:00.0
Model: 		 Tesla K80
IRQ:   		 53
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:90:00.0
Model: 		 Tesla K80
IRQ:   		 53
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:93:00.0
Model: 		 Tesla K80
IRQ:   		 53
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 36 bits
DMA Mask: 	 0xfffffffff
Bus Location: 	 0000:94:00.0

I find that hard to believe. Yes, I work for NVIDIA.

First of all, K80’s are designed to be sold ONLY in OEM-qualified servers. It is not a GPU that is designed for “build your own systems”. Nobody at NVIDIA is supposed to “vet” any such configuration. If you search these forums, you’ll find plenty of similar statements in response to people who have acquired their own K80’s and desire to build their own systems.

I don’t think you’ll find any published statements from NVIDIA anywhere that suggest building your own system around K80 is a good idea.

Second, as stated elsewhere, K80 have significant power and cooling requirements, as well as system configuration requirements that would normally be handled by the system BIOS. It’s not clear if you’ve accounted for any of this (and it is basically impossible for an end user to account for proper cooling - it requires programmatic interaction with the firmware on the server BMC, as well as a proper motherboard design, not to mention airflow, ducting, and system fan design). With respect to the system BIOS, the more GPUs you place in a system, the harder it is for a system BIOS to do it’s job and assign proper system resources to all GPUs.

Third, as you go beyond 8 GPU devices (a single K80 presents 2 devices to the system) you are pretty much in uncharted territory. It’s quite difficult to find even OEM-qualified systems in this category. It’s not clear when you say “8 K80’s” whether you mean 8 K80’s (i.e. 16 devices) or 4 K80’s (i.e. 8 devices), but your device output indicates the former - ie. 16 devices. Trying to configure 8 K80’s (16 devices) would probably not work from a system BIOS perspective with most available motherboards and current production system BIOSes.

Finally, you appear to be running a kernel that is not even validated for use with CUDA:

http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements

Sorry if this seems very unhelpful – I’m sure it seems that way. You’re welcome to try whatever you wish and ask for whatever support you wish from the community. But NVIDIA does not bless or “vet” such configurations, and K80 is not a GPU that is designed to be end-user assembled into such a system.