Deciphering an NVRM: Xid message?

eelsen · May 27, 2011, 10:08pm

Sorry to be reviving such an old thread, but I seem to running into some issues that seem related to the ones encountered here almost 3 years ago. Except the NVRM message I’m getting is:

nvidia kernel: NVRM: Xid (0000:03:00): 31, Ch 00000001, engmask 00000101, intr 10000000

Can anyone enlighten me as to what error code 31 means?

tmurray · May 27, 2011, 10:24pm

MMU fault, so you accessed a bad pointer.

ceearem · June 30, 2011, 6:10pm

Hi

I started to get the same error now, after siwtching from CentOS 5.5 to 5.6 on a machine with two GTX470 and a GT220 for screen output.

NVRM: Xid (0000:03:00): 31, Ch 00000001, engmask 00000101, intr 10000000

NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

After that error a simple program like this:

#include <cstdio>

int main()

{

cudaError_t error = cudaSetDevice(0);

printf("Error A: %i\n",error);

void* pointer;

error = cudaMalloc(&pointer,1000);

printf("Error B: %i\n",error);

return 0;

}

hangs forever after printing “Error A 0” if started on the GPU which gave the error (can be both if both gave that error).

It does not depend on whether we use “thread exclusive” or “default” mode.

The error occured with 270.xx.xx drivers, and 275.xx.xx drivers using CUDA 4.0. We also tested Scientific Linux 6.0 with the same result.

It seems that the error does not occur if no X-Server is running or the X-Server uses the VESA module instead of the nvidia driver.

We have also run the code which produces the error in the first place with cuda-memcheck without any error being reported.

Does anyone have any suggestions how to proceed?

Cheers

Ceearem

ceearem · June 30, 2011, 8:53pm

I just happened to be interactively on a machine when the GPU got stuck.
The machine was just rebootet, on the first GPU another job run, on the second the first job crashed.
There was no crash related output from the job itself.

dmesg:
NVRM: Xid (0000:04:00): 13, 0001 00000000 000090c0 00002388 20129300 00000000

lspci -k | grep VGA
03:00.0 VGA compatible controller: nVidia Corporation GF100 [GeForce GTX 470] (rev a3)
04:00.0 VGA compatible controller: nVidia Corporation GF100 [GeForce GTX 470] (rev a3)
07:00.0 VGA compatible controller: nVidia Corporation G84 [GeForce 8600 GT] (rev a1)

nvidia-smi -a | grep Driver
Driver Version : 275.09.07

nvidia-smi -s
COMPUTE mode rules for GPU 0: 1
COMPUTE mode rules for GPU 1: 1
COMPUTE mode rules for GPU 2: 2

uname -a
HOSTNAME 2.6.18-238.12.1.el5 #1 SMP Tue May 31 13:22:04 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

After booting I took the system into “init 3”.
So there was no X-Server running.

Ah before I forget, the crashs dont seem to be predictable. Sometimes the machine works for 10 runs or more, and now the second started job produced the problem.

Cheers
Ceearem

ross123456789 · October 17, 2011, 7:54pm

GTX 580 is throwing the following error and locks up the card.

kernel: NVRM: Xid (0000:02:00): 44, 0000 00000000 0000 0000 00000000 00000000 00000000

Does anyone know what error # 44 is?

Any help is appreciated.

ivanwick_tvec · January 2, 2012, 8:39pm

Are a table of Xid error codes and their descriptions published anywhere?

Above, you mention that Xid 31 means an MMU fault, probably due to bad pointer dereference. However, can it also be caused by something else, e.g. MMU fault for some reason during full duplex memory transfer between the device and host?

ross123456789 · January 8, 2012, 7:05pm

Yes, I am also looking for NVRM Xid error codes table from Nvidia.

Normally, I have seen Xid 13 whenever there is “cuda launch failure” from any cuda program. The only way to rescue round it is to cold reboot the machine.

elsifaka · April 1, 2012, 9:14am

bump,

same error here, I’ve got a GTX460M with Linux 3.2.13-1-ARCH and nvidia 295.33-1 driver (I’ve never got this GPU working on linux)

is it a hardware problem ?