Hi,
I have a Tesla V100 PCIE card in a workstation running Ubuntu 18.04 with driver version 410.48 and CUDA 10 installed. When I restart the machine, the V100 seems to run fine:
joishi@a23637:~$ nvidia-smi
Thu Jan 24 15:32:34 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K620 Off | 00000000:03:00.0 Off | N/A |
| 35% 53C P0 2W / 30W | 0MiB / 1999MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-PCIE… Off | 00000000:04:00.0 Off | 0 |
| N/A 73C P0 53W / 250W | 0MiB / 16130MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
I have tested it with the Black Scholes test from the CUDA samples, and it works fine:
joishi@a23637:~/cuda_samples/NVIDIA_CUDA-10.0_Samples/bin/x86_64/linux/release$ ./BlackScholes
[./BlackScholes] - Starting…
GPU Device 0: “Tesla V100-PCIE-16GB” with compute capability 7.0
Initializing data…
…allocating CPU memory for options.
…allocating GPU memory for options.
…generating input data in CPU mem.
…copying input data to GPU mem.
Data init done.
Executing Black-Scholes GPU kernel (512 iterations)…
Options count : 8000000
BlackScholesGPU() time : 0.100197 msec
Effective memory bandwidth: 798.425002 GB/s
Gigaoptions per second : 79.842500
BlackScholes, Throughput = 79.8425 GOptions/s, Time = 0.00010 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128
Reading back GPU results…
Checking the results…
…running CPU calculations.
Comparing the results…
L1 norm: 1.741792E-07
Max absolute error: 1.192093E-05
Shutting down…
…releasing GPU memory.
…releasing CPU memory.
Shutdown done.
[BlackScholes] - Test Summary
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Test passed
However, after some time, the card simply stops working:
joishi@a23637:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:04:00.0: Unknown Error
The machine also has an Quadro K620 in it. If I reboot, the V100 is detected again. I’ve attached the output from nvidia-bug-report.sh to this message. Any advice is very appreciated!
thanks,
Jeff
nvidia-bug-report.log.gz (183 KB)