Kernel Panel and confusion around nvidia-smi

Hi There
I had an issue reported on a server which run’s several nvidia GPU’s, it went offline after a Kernel panic.

Prior to the Kernel panic I found the following sequence of messages repeated many times. This can be resolved by amending the grub loader, however I’m unsure whether this will just cover up the issue.

[12245249.690604] pcieport 0000:80:02.0: AER: Corrected error received: id=8010
[12245249.690614] pcieport 0000:80:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=8010(Transmitter ID)
[12245249.692700] pcieport 0000:80:02.0: device [8086:6f04] error status/mask=00001000/00002000
[12245249.694642] pcieport 0000:80:02.0: [12] Replay Timer Timeout

On top of this, I found that the nvidia-smi command reports no processes running while also saying one of the cards is being utilised at 70 - 95 %. There was some advice on another thread which said it maybe worth while reinstalling

+------------------------------------------------------+                       
| NVIDIA-SMI 352.99     Driver Version: 352.99         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:04:00.0     Off |                    0 |
| N/A   45C    P0    59W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
| N/A   36C    P0    73W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K40c          Off  | 0000:81:00.0     Off |                    0 |
| 23%   33C    P0    64W / 235W |     23MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   42C    P0    61W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:85:00.0     Off |                    0 |
| N/A   34C    P0    75W / 149W |     55MiB / 11519MiB |     89%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Further more, when ever I run nvidia-smi, I get the following error reported in the syslog.

Feb  6 11:20:20 sci-gpu nvidia-persistenced: Started (6533)
Feb  6 11:20:34 sci-gpu nvidia-persistenced: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Feb  6 11:20:34 sci-gpu nvidia-persistenced: Shutdown (6533)

Running with
Ubuntu 14.04 / Cuda compilation tools, release 7.5, V7.5.17 / NVIDIA binary driver - version 352.99

Try using gpu-burn to test the stability of the GPUs. Might want to do the test 1 by 1, then try more than 1 at a time and see what happens. If it is some hardware issue, I can almost guarantee that program will let you reproduce the issue and figure out if it’s a faulty slot/faulty GPU or if kernel panic is unrelated to NVIDIA cards altogether.

If you know what the kernel panic details are that would be helpful, and if it’s nvidia-related, send a bug-report. Would recommend updating BIOS/UEFI in that system as well if not done already.

The only thing I can directly comment is the pcieport messages – happened on my X99 board, and were harmless, but I had another underlying problem – see here:

http://www.overclock.net/t/1539708/question-for-x99-board-owners-with-nvidia-cards-do-you-see-pcie-bus-errors-please-respond-to-poll/40

In re: nvidia-persistenced:

https://devtalk.nvidia.com/default/topic/934350/nvidia-persistenced-failed-to-query-nvidia-devices-/

Seems like just that issue you have is harmless.

Is this a server acquired from an NVIDIA-blessed integrator, or is this a self-built system? The mix of two K80 modules (passively cooled) and a K40c (actively cooled) seems to suggest the latter. A plethora of issues is observed when using K80s in systems that weren’t put together by an official integrator, as can be seen from he fair amount of questions in these forums.

In addition to potential system BIOS issues (make sure you use the latest available), make sure there is sufficient cooling and power supply for the K80s. A single 1600W PSU (I would suggest a 80Plus Platinum rated PSU for robustness and efficiency) seems just about adequate. The nominal wattage of all system components combined should normally not be higher than 60% of the nominal wattage of the PSU.

Make sure all GPUs are securely mounted, to avoid mechanical stress on the PCIe connectors and to prevent mechanical vibrations from fans or HDs negatively impacting the integrity of the PCIe signalling.

Good news everyone, I ran gpuBurn and found no issues from this test.

Tested 5 GPUs:
	GPU 0: OK
	GPU 1: OK
	GPU 2: OK
	GPU 3: OK
	GPU 4: OK

The Kernel panic was particularly severe as it just printed “^@^@^@^@^@^@^@…” when the system stopped. This could be down to just some badly compiled binary, I will have to look further at this.

@vacaloca thanks for the links, I will definitely follow up x99 link.

@njuffa I’m unsure whether the supplier was a nvidia blessed supplier, but it is a well built rack system with plenty of cooling and 2k redundant power.

Thanks both for your advice!

Not sure what you mean by “badly compiled binary”. A user-land program, no matter how poorly behaved, should not be able to trigger a kernel panic. If you mean the OS itself could be flawed, that could be a possibility, not sure how likely.

My original guess was that some sort of transient hardware problem had occurred, quite possibly nothing directly to do with the GPUs in the system, but possibly a brown-out under heavy usage. However with a 2 KW PSU, that seems very unlikely.

Sites dedicated to Ubuntu, like askubuntu.com, may be able to provide more targetted help as to how to get to the bottom of a kernel panic in Ubuntu.