Hi There
I had an issue reported on a server which run’s several nvidia GPU’s, it went offline after a Kernel panic.
Prior to the Kernel panic I found the following sequence of messages repeated many times. This can be resolved by amending the grub loader, however I’m unsure whether this will just cover up the issue.
[12245249.690604] pcieport 0000:80:02.0: AER: Corrected error received: id=8010
[12245249.690614] pcieport 0000:80:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=8010(Transmitter ID)
[12245249.692700] pcieport 0000:80:02.0: device [8086:6f04] error status/mask=00001000/00002000
[12245249.694642] pcieport 0000:80:02.0: [12] Replay Timer Timeout
On top of this, I found that the nvidia-smi command reports no processes running while also saying one of the cards is being utilised at 70 - 95 %. There was some advice on another thread which said it maybe worth while reinstalling
+------------------------------------------------------+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:04:00.0 Off | 0 |
| N/A 45C P0 59W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 36C P0 73W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K40c Off | 0000:81:00.0 Off | 0 |
| 23% 33C P0 64W / 235W | 23MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:84:00.0 Off | 0 |
| N/A 42C P0 61W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:85:00.0 Off | 0 |
| N/A 34C P0 75W / 149W | 55MiB / 11519MiB | 89% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Further more, when ever I run nvidia-smi, I get the following error reported in the syslog.
Feb 6 11:20:20 sci-gpu nvidia-persistenced: Started (6533)
Feb 6 11:20:34 sci-gpu nvidia-persistenced: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Feb 6 11:20:34 sci-gpu nvidia-persistenced: Shutdown (6533)
Running with
Ubuntu 14.04 / Cuda compilation tools, release 7.5, V7.5.17 / NVIDIA binary driver - version 352.99