continue after error cutilSafeCall aborts program on error

Is it possible for a Linux program to recover from
cudaSafeCall() Runtime API error in file <gpu.cpp>, line 58 : unspecified launch failure.
and continue?
Eg after one kernel has failed, I would like to try another one.
Thank you


ps: This particular error says it is reported by cutilSafeCall( cudaMemcpy(…cudaMemcpyDeviceToHost));
after Kernel<<<grid_size, block_size>>>(…)

    Dr. W. B. Langdon,
    Department of Computer Science,
    King's College London,
    Strand, London, WC2R 2LS, UK

FOGA 2011
CIGPU 2010
A Field Guide to Genetic Programming
GP Bibliography

cutil is a library only intended for use by SDK apps to make them easier to follow and should not be used by anyone else. It is specifically not part of CUDA.

Dear Tim,

Thank you for your reply.

Can you recommend a way to report errors detected by a CUDA kernel.

Also, if a kernel fails, is there a good way to start another one?



I am slighly puzzelled that you said

If I look in ReleaseNotes.html it appears to be advocating the re-use of SDK code


The CUDA Developer SDK provides examples with source code, utilities, and white papers to help you get started writing software with CUDA. The SDK includes dozens of code samples covering a wide range of applications …[/codebox]

and CUDA 2.3 /opt/cuda/sdk/C/src/template/ uses cutilCheckMsg, cutilSafeCall and other cutil routines.

If you are using the runtime API, then cudaGetLastError() and cudaGetErrorString() should do most of what you want most of the time. However, for kernel errors you might need to check the return status of cudaThreadSynchronize() instead, because some classes of kernel error don’t happen at launch, but afterwards. That can have negative implications on overall performance if you are trying to overlap host computations with GPU computations.

As to recovering from kernel errors, the runtime API isn’t fault tolerant under all circumstances, and sometimes contexts are closed where errors occur, in which case program execution cannot continue. You would probably be better off adding some predictive code before the kernel launch which can avoid kernel failures ever happening in the first place. In some of our codes, we have added simple execution time, complexity and memory footprint models which are called prior to launching a kernel. They compute execution parameters for the kernel launch and arguments for the kernel function, and can flag when resource exhaustion or watchdog timeout should occur and that can be handled by the caller.