Can drivers suddenly break?
Is my card dead? How can I diagnose if hardware issue is present?
I was running a series of tensorflow jobs on a linux machine with a NVIDIA GeForce RTX 2070 SUPER. The jobs had been running for 4-5 days when subsequent jobs could no longer find the GPU and tensorflow reverted to the CPU to do model fitting. I suspect the card overheated and died but I can’t confirm this. I noticed the issue when I noticed that jobs were taking longer to complete. When I noticed it was no longer using the gpu I ran nvidia-smi and saw ERR in the fan % field and power field. After rebooting the nvidia-smi command returned “No devices were found”; before rebooting the nvdiai-smi command showed status but it was not varying, and ERR was in fan fields.
The tensorflow jobs were many separate model fit/evaluates that were submitted by putting each job into a shell script and then running, ls *sh | xargs -n 1 -P 2
which ran two tensorflow jobs concurrently on the GPU card.
This screenshot shows a dmesg call
Here’s tf log output when tensorflow starts up when gpu was working
2021-08-13 16:59:27.430099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.77GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
Here’s the log output when tensorflow starts up when gpu is not working
2021-08-19 08:40:10.775228: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-19 08:40:14.690222: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-08-19 08:40:14.711889: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-08-19 08:40:14.711935: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: t3-rd
2021-08-19 08:40:14.711946: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: t3-rd
2021-08-19 08:40:14.712048: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 465.19.1
2021-08-19 08:40:14.712078: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 465.19.1
2021-08-19 08:40:14.712088: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 465.19.1