Tesla V100 PCIE fails after some time on Ubuntu 18.04


I have a Tesla V100 PCIE card in a workstation running Ubuntu 18.04 with driver version 410.48 and CUDA 10 installed. When I restart the machine, the V100 seems to run fine:

joishi@a23637:~$ nvidia-smi
Thu Jan 24 15:32:34 2019
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 Quadro K620 Off | 00000000:03:00.0 Off | N/A |
| 35% 53C P0 2W / 30W | 0MiB / 1999MiB | 0% Default |
| 1 Tesla V100-PCIE… Off | 00000000:04:00.0 Off | 0 |
| N/A 73C P0 53W / 250W | 0MiB / 16130MiB | 0% Default |

| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| No running processes found |

I have tested it with the Black Scholes test from the CUDA samples, and it works fine:

joishi@a23637:~/cuda_samples/NVIDIA_CUDA-10.0_Samples/bin/x86_64/linux/release$ ./BlackScholes
[./BlackScholes] - Starting…
GPU Device 0: “Tesla V100-PCIE-16GB” with compute capability 7.0

Initializing data…
…allocating CPU memory for options.
…allocating GPU memory for options.
…generating input data in CPU mem.
…copying input data to GPU mem.
Data init done.

Executing Black-Scholes GPU kernel (512 iterations)…
Options count : 8000000
BlackScholesGPU() time : 0.100197 msec
Effective memory bandwidth: 798.425002 GB/s
Gigaoptions per second : 79.842500

BlackScholes, Throughput = 79.8425 GOptions/s, Time = 0.00010 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Reading back GPU results…
Checking the results…
…running CPU calculations.

Comparing the results…
L1 norm: 1.741792E-07
Max absolute error: 1.192093E-05

Shutting down…
…releasing GPU memory.
…releasing CPU memory.
Shutdown done.

[BlackScholes] - Test Summary

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Test passed

However, after some time, the card simply stops working:

joishi@a23637:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:04:00.0: Unknown Error

The machine also has an Quadro K620 in it. If I reboot, the V100 is detected again. I’ve attached the output from nvidia-bug-report.sh to this message. Any advice is very appreciated!


nvidia-bug-report.log.gz (183 KB)

I think your problem is very likely heat:

N/A 73C P0 53W / 250W

To see 73c while idling at 53W is a big warning sign.

The Tesla V100 does NOT have a fan or cooling of any type, as it is designed to be installed in a rack server which supplies airflow. If you have it installed in a workstation or typical PC case, you MUST provide cooling airflow to the card. The Tesla V100 can dissipate about 300 watts and needs very significant airflow to keep from overheating.

It isn’t designed to operate in a PC/workstation, as it lacks any sort of fan or blower.