Tesla V100 PCIE fails after some time on Ubuntu 18.04

joishi · January 29, 2019, 7:57pm

Hi,

I have a Tesla V100 PCIE card in a workstation running Ubuntu 18.04 with driver version 410.48 and CUDA 10 installed. When I restart the machine, the V100 seems to run fine:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

I have tested it with the Black Scholes test from the CUDA samples, and it works fine:

joishi@a23637:~/cuda_samples/NVIDIA_CUDA-10.0_Samples/bin/x86_64/linux/release$ ./BlackScholes
[./BlackScholes] - Starting…
GPU Device 0: “Tesla V100-PCIE-16GB” with compute capability 7.0

Initializing data…
…allocating CPU memory for options.
…allocating GPU memory for options.
…generating input data in CPU mem.
…copying input data to GPU mem.
Data init done.

Executing Black-Scholes GPU kernel (512 iterations)…
Options count : 8000000
BlackScholesGPU() time : 0.100197 msec
Effective memory bandwidth: 798.425002 GB/s
Gigaoptions per second : 79.842500

BlackScholes, Throughput = 79.8425 GOptions/s, Time = 0.00010 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Reading back GPU results…
Checking the results…
…running CPU calculations.

Comparing the results…
L1 norm: 1.741792E-07
Max absolute error: 1.192093E-05

Shutting down…
…releasing GPU memory.
…releasing CPU memory.
Shutdown done.

[BlackScholes] - Test Summary

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Test passed

However, after some time, the card simply stops working:

joishi@a23637:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:04:00.0: Unknown Error

The machine also has an Quadro K620 in it. If I reboot, the V100 is detected again. I’ve attached the output from nvidia-bug-report.sh to this message. Any advice is very appreciated!

thanks,

Jeff
nvidia-bug-report.log.gz (183 KB)

Larry-SB · January 29, 2019, 10:14pm

I think your problem is very likely heat:

N/A 73C P0 53W / 250W

To see 73c while idling at 53W is a big warning sign.

The Tesla V100 does NOT have a fan or cooling of any type, as it is designed to be installed in a rack server which supplies airflow. If you have it installed in a workstation or typical PC case, you MUST provide cooling airflow to the card. The Tesla V100 can dissipate about 300 watts and needs very significant airflow to keep from overheating.

It isn’t designed to operate in a PC/workstation, as it lacks any sort of fan or blower.

Topic		Replies	Views
Nvidia Tesla P100 keeps throwing ECC errors CUDA Programming and Performance cuda , ubuntu , driver	2	499	July 2, 2024
V100 GPU on new workstation getting very warm when idle Linux	13	931	May 2, 2024
Tesla V100 GPU thermal causing shutdown even it's doing nothing Linux boot , kernel , ubuntu	10	1509	December 17, 2020
Tesla P100 Issue – Processing Stops at 8MiB, Multiple Driver Versions Tested nvc, nvc++ and nvfortran cuda	9	165	December 19, 2024
Installing Tesla P100 on Ubuntu 16.04 Server with a 1060GTX CUDA Setup and Installation	5	3120	April 20, 2017
Tesla V100 SW Thermal Slowdown active GPU-Accelerated Libraries cuda	1	1693	December 10, 2020
Is it normal for my Tesla P100-PCIE-16GB GPU to restart at 84°C? General Topics and Other SDKs cuda	4	53	December 29, 2024
A100 crashes within 10 minutes due to over-heating on Ubuntu 18.04 (without any workload) Linux ubuntu , driver	7	3169	December 3, 2021
All CUDA-capable devices are busy or unavailable Tesla V100 Accelerated Computing cuda	0	837	December 28, 2020
Failed to run deviceQuery - cuda 10.2 Tesla V100 CUDA Setup and Installation	1	3741	December 2, 2019

Tesla V100 PCIE fails after some time on Ubuntu 18.04

Related topics