Titan V freezes on Ubuntu 14.04

[EDIT: sorry, I read too late that this forum part is for GPU cloud users; could someone move my post to the right location? Thanks!]

Dear all,

I have the issue that my NVIDIA Titan V freezes during CUDA computations (Deep Learning using PyTorch 1.0.1) at random. Sometimes it would last for 30 minutes, sometimes it would lock up after just two. The symptoms are always the same:

  • Computation comes to a halt (no output from console anymore)
  • nvidia-smi shows 100% GPU load, but low temperatures (so the GPU is actually idling)
  • The system itself does not lock up; I can easily restart the process (for another few minutes)

Hardware:

  • Motherboard: ASUS P9D WS with BIOS 2202
  • CPU: Intel Xeon E3-1225 v3
  • GPUs: PCIe slot 1 (x16): NVIDIA Titan V; slot 3 (x8): MSI GeForce GTX 980 Ti (used for CUDA and display)
  • PSU: Corsair RM750 (750W)

Software:

  • OS: Ubuntu 14.04 Trusty
  • NVIDIA Driver: 410.48
  • CUDA Toolkit: 10.0.130 (previous versions are also installed, but not used anymore)

nvidia-smi output after lockup:

xxx@XXX:~$ nvidia-smi
Mon Feb 18 09:44:25 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN V             On   | 00000000:01:00.0 Off |                  N/A |
| 34%   50C    P2    39W / 250W |   1935MiB / 12034MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 980 Ti  On   | 00000000:03:00.0  On |                  N/A |
| 63%   74C    P2   229W / 250W |   5564MiB /  6083MiB |     72%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      9712      C   ...xxxxxx/anaconda2/envs/p36/bin/python3.6  1923MiB |
|    1      1151      G   /usr/lib/xorg/Xorg                           640MiB |
|    1      2502      G   compiz                                       142MiB |
|    1      2664      G   ...-token=EDC037A6A283AB467CFDD4888DC69C9E    17MiB |
|    1      2750      G   ...-token=9EA2947C5ADA1E30BC62EC4CE1CF063B    78MiB |
|    1      3457      G   /usr/lib/firefox/firefox                       3MiB |
|    1      3700      G   /usr/lib/firefox/firefox                       3MiB |
|    1      3717      G   /usr/lib/firefox/firefox                       3MiB |
|    1      9476      C   ...xxxxxx/anaconda2/envs/p36/bin/python3.6  4662MiB |
|    1     21160      G   ...opt/mendeleydesktop/bin/mendeleydesktop     3MiB |
+-----------------------------------------------------------------------------+

Could it be the OS (I’ll update to 18.04 soon anyway), the BIOS, or a faulty card?

Thank you very much for your answers!