[EDIT: sorry, I read too late that this forum part is for GPU cloud users; could someone move my post to the right location? Thanks!]
Dear all,
I have the issue that my NVIDIA Titan V freezes during CUDA computations (Deep Learning using PyTorch 1.0.1) at random. Sometimes it would last for 30 minutes, sometimes it would lock up after just two. The symptoms are always the same:
- Computation comes to a halt (no output from console anymore)
- nvidia-smi shows 100% GPU load, but low temperatures (so the GPU is actually idling)
- The system itself does not lock up; I can easily restart the process (for another few minutes)
Hardware:
- Motherboard: ASUS P9D WS with BIOS 2202
- CPU: Intel Xeon E3-1225 v3
- GPUs: PCIe slot 1 (x16): NVIDIA Titan V; slot 3 (x8): MSI GeForce GTX 980 Ti (used for CUDA and display)
- PSU: Corsair RM750 (750W)
Software:
- OS: Ubuntu 14.04 Trusty
- NVIDIA Driver: 410.48
- CUDA Toolkit: 10.0.130 (previous versions are also installed, but not used anymore)
nvidia-smi output after lockup:
xxx@XXX:~$ nvidia-smi
Mon Feb 18 09:44:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN V On | 00000000:01:00.0 Off | N/A |
| 34% 50C P2 39W / 250W | 1935MiB / 12034MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 980 Ti On | 00000000:03:00.0 On | N/A |
| 63% 74C P2 229W / 250W | 5564MiB / 6083MiB | 72% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9712 C ...xxxxxx/anaconda2/envs/p36/bin/python3.6 1923MiB |
| 1 1151 G /usr/lib/xorg/Xorg 640MiB |
| 1 2502 G compiz 142MiB |
| 1 2664 G ...-token=EDC037A6A283AB467CFDD4888DC69C9E 17MiB |
| 1 2750 G ...-token=9EA2947C5ADA1E30BC62EC4CE1CF063B 78MiB |
| 1 3457 G /usr/lib/firefox/firefox 3MiB |
| 1 3700 G /usr/lib/firefox/firefox 3MiB |
| 1 3717 G /usr/lib/firefox/firefox 3MiB |
| 1 9476 C ...xxxxxx/anaconda2/envs/p36/bin/python3.6 4662MiB |
| 1 21160 G ...opt/mendeleydesktop/bin/mendeleydesktop 3MiB |
+-----------------------------------------------------------------------------+
Could it be the OS (I’ll update to 18.04 soon anyway), the BIOS, or a faulty card?
Thank you very much for your answers!