Hi, I installed a 2080 Ti and run several DL jobs on it. It turns out that after 20 minutes or so it always froze the system and have ERR! shown in both the Fan and PowerUsage from the nvidia-smi. I have driver version 415.18 and running on CUDA 9.2. Any idea what’s going on?
My training first shows:
RuntimeError: cuda runtime error (73) : an illegal instruction was encountered
Then nvidia-smi becomes:
Sat Dec 8 17:19:57 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 415.18 Driver Version: 415.18 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:09:00.0 On | N/A |
| 23% 34C P8 10W / 250W | 142MiB / 12196MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 108… Off | 00000000:0A:00.0 Off | N/A |
| 23% 38C P8 9W / 250W | 2MiB / 11178MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 108… Off | 00000000:42:00.0 Off | N/A |
| 23% 37C P8 9W / 250W | 2MiB / 11178MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 GeForce RTX 208… Off | 00000000:43:00.0 Off | N/A |
|ERR! 52C P2 ERR! / 300W | 1MiB / 10986MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1526 G /usr/lib/xorg/Xorg 57MiB |
| 0 1571 G /usr/bin/gnome-shell 82MiB |
±----------------------------------------------------------------------------+
nvidia-smi is just displaying the cuda driver version, i.e the maximum supported cuda version, not the installed cuda version.
Since it also fails with cuda 10, maybe check the card for a hw failure using cuda-memtest.
I also have problem with this. Mine is the supermicro with 8 GTX 1080 Ti GPUs running on driver 410.78
Please advise what I should do to fix this. The Gpus report Fan as ERR! and watt also ERR! It sometimes happens and the only way to fix at the moment is to reboot the machine, but we have a lot of ongoing jobs on the server, which makes us reluctant to reboot the server. This problem cause extreme latency and delay and lags for all processes currently.
In my case the problem is not due to the card, it is due to the location. The one in the middle always shows this error. Also the temperature seems to be fine.
When you were getting this error, was the temperature higher than 75C?
The same thing happens on my Titan V, but only after ~6 hours of training at ~85C. I’ll try underwatting it to 150W and holding it at ~77C. I’m using driver version 415.27 and CUDA 10, along with a 2080 Ti and Titan RTX in the same system.
I’m wondering if anyone was able to solve this.
We’re getting similar issues on an 8-GPU supermicro system, even when only using 3 GPUs. Issue seems to occur whenever we have two cards next to each other in the slots.
The same machine is super stable with GTX Titan X. It was crashing every few hours when we tried swapping to Titan Xp, and now we get a similar behavior with RTX 2080 Ti. We didn’t try doing a thorough troubleshooting with the Xp, but with the 2080 Ti we are now trying to narrow down the issues.
It does not seem to be the cards per se as the same card can perform well in a slot, and misbehave in another slot. Typical misbehavior is: for one or two of the 3 GPUs, GPU Power Draw ramps up to near 200W, but within a few seconds gets back down to 100W, with “SW Thermal Slowdown” reported by nvidia-smi, fan speed getting to ~70% and sometimes giving Fan ERR!, and GPU Temp getting over 90C. Meanwhile, one of the 3 cards (sometimes 2) behaves perfectly fine, fan speed stays around 40%, temp around 70C, power draw over 200W.
gpu_burn does not report any issue with the cards.
NVIDIA driver version is 410.79.