cuda-gdb Error: Failed to suspend device (dev=0, error=10).

Hello everyone,

I would like to get some high-level suggestions/hypothesis about
an odd problem I’m experiencing.

I have a program that is essentially a tree exploration, based on a
recursive call on the host side, where, at each call, a blocking kernel is launched.

The problem is that the execution aborts during a kernel call with
the generic unspecified failure message.

Unfortunately the error is non-reproducible, namely, it happens consistently but
every time at different time steps on 2 specific machines (after roughly 10K kernel calls).
The only case when the problem is absent, is when the kernels are launched with one single block.
The other machines I tested present no errors at all with every grid configuration.

The cuda-gdb reports a strange error message (10). I did not find any documentation about this,
but I believe it could be related more to the OS/HW than to the program itself.
The code is rather involved, but it was tested on different systems.
The partial executions are the same until the kernel crash, so I’m not suspecting a bug in the code
and cuda memcheck reports no errors.

In every machine the kernelExecTimeoutEnabled reads 0 and I compiled with -arch=sm_21.

The following 2 configurations cause the problem:
Red Hat Enterprise Linux Workstation release 6.2 (Santiago), 12 core Xeon e55645 at 2.4GHZ, 32GB RAM
Tesla C2075,
nvcc release 4.1, V0.2.1221

openSUSE 11.3 (x86_64), Host: 4 core Xeon e5405 a 2GHZ, 2GB ram
Quadro 4000
nvcc release 4.0, V0.2.1221

While every other system is ok (here an example):

Mandriva Linux release 2011.0 (Official) for x86_64, MD Opteron 270, 2.01GHz, RAM 4GB
GTS 450
nvcc release 4.0, V0.2.1221

Do you have any high-level suggestions?
Probably I’m missing some setup/configuration issues related to the OS/cards.

Do you have any details about the error message I get?

Thank you in advance for your comments,
Alessandro

Hello, I’m getting exactly the same error.
Memcheck says everything is ok, but cuda-gdb just can’t finish running my program.

[Launch of CUDA Kernel 15 (migration_A2A_Kernelc<<<(157,1,1),(256,1,1)>>>) on Device 0]
Error: Failed to suspend device (dev=0, error=10).

System:
CentOS 6.2 2.6.32-220.2.1.el6.x86_64, Tesla 2075

SDK:
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2011 NVIDIA Corporation
Built on Thu_Nov_17_17:38:12_PST_2011
Cuda compilation tools, release 4.1, V0.2.1221

Could it be something related to Linux kernel only, I can’t test it on windows machine…

Any ideas what could be wrong?

M.

I got the same problem and I post here to bring this topic on the top of the list. And hopefully someone can give an answer

I am not sure of this now :
I got this idea that it could be caused by a too long stay in the kernel (possible when there is a lot of compute to do).
Indeed, I had this problem and I changed a parameter to reduce a loop in my kernel to fix it.
If I don’t run my program with gdb, it seems to be an infinite loop, but not any messages.

What do you think of this idea?

I also reboot the computer, so far I think that this is what corrected my (random) problem. But I don’t know why

In my case, I do have a while loop, but in normal conditions

this would iterate for a few times.

It looks like exactly one of the blocks that execute does not respond anymore

(tried with printf inside the kernel).

Also sometimes the problem arises before the while loop, between

kernel instantiation and execution (in the meanwhile the other blocks

terminate correctly).

I don’t have any kernel timeout set, so I’m for a different hypothesis

than simple infinite loop.

Moreover the fact that sometimes I’m able to run the same kernel and

sometimes not (after thousands of previous calls),

let me think about some different problem, I try (but I’m really puzzled):

  • many kernels are lanched and gdb does not acknowledge their termination.

I see that this is rather normal (especially if you put a cudaThreadSynchonize

or similar). However this may be a symptom in conjunction with some specific

hw related problem

  • there is a stack related to GPU kernel launches that is underdimensioned

  • the issue is not strictly related to the kernel thread activity/program