Random Failures with RTX Cards in TCC Mode for Calculation

mhardy3 · July 28, 2022, 11:40pm

We are getting multiple errors when attempting to run calculations through 2 GPU’s of various models in our software (RTX5000’s, RTX A2000’s, P4000’s) they occur randomly across multiple PC’s

ErrorLaunchFailed: An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory.
The context cannot be used, so it must be destroyed (and a new one should be created).

All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA. —> ManagedCuda.CudaException: ErrorLaunchFailed: An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory.
The context cannot be used, so it must be destroyed (and a new one should be created).

njuffa · July 29, 2022, 7:37pm

There is a bug in your code. The proximal cause of failure is an out-of-bounds access in device code, but the root cause could be in host or device code. I would suggest:

Double check that the return status of all CUDA API calls and all kernel launches is checked. For example, there could be a failing allocation, causing the code to operate on an invalid pointer or there could be a failed copy, causing device code to operate on uninitialized data.
Run the code under control of compute-sanitizerand address any issues it identifies. Device code may operate on an invalid pointer, or process uninitialized data, or make an out-of-bounds access prior to the failing one (e.g. exceeds the bounds of the data object, but pointer lands in a neighboring data object that is properly allocated), or contain a race condition giving rise to one of the other conditions.

Topic		Replies	Views
Invalid device function CUDA Programming and Performance	10	6588	November 19, 2008
Random Launch Failure CUDA Programming and Performance	2	1304	March 1, 2010
cudaErrorLaunchFailure -- potential causes? CUDA Programming and Performance	1	6796	June 2, 2010
Getting around apparent CUDA bugs CUDA Programming and Performance	5	1075	September 20, 2011
Change of device causes "Unspecified launch failure" CUDA Programming and Performance	1	6808	August 30, 2011
"unspecified launch failure" CUDA Programming and Performance	2	8879	August 3, 2010
kernel works on Gtx280/295/480 but not on C2050 unspecified launch failure CUDA Programming and Performance	38	3194	September 23, 2010
cudaSafeCall() Runtime API error in file <main.cu>, line 76 : unspecified launch failure I am CUDA Programming and Performance	2	11119	July 6, 2009
kernel fails to launch (example included) stripped down self-contained example CUDA Programming and Performance	5	2682	August 27, 2007
Simple code won't work: unspecified launch failure CUDA Programming and Performance	0	808	October 20, 2011

Random Failures with RTX Cards in TCC Mode for Calculation

Related topics