Random Failures with RTX Cards in TCC Mode for Calculation

We are getting multiple errors when attempting to run calculations through 2 GPU’s of various models in our software (RTX5000’s, RTX A2000’s, P4000’s) they occur randomly across multiple PC’s

ErrorLaunchFailed: An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory.
The context cannot be used, so it must be destroyed (and a new one should be created).

All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA. —> ManagedCuda.CudaException: ErrorLaunchFailed: An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory.
The context cannot be used, so it must be destroyed (and a new one should be created).

There is a bug in your code. The proximal cause of failure is an out-of-bounds access in device code, but the root cause could be in host or device code. I would suggest:

  1. Double check that the return status of all CUDA API calls and all kernel launches is checked. For example, there could be a failing allocation, causing the code to operate on an invalid pointer or there could be a failed copy, causing device code to operate on uninitialized data.

  2. Run the code under control of compute-sanitizerand address any issues it identifies. Device code may operate on an invalid pointer, or process uninitialized data, or make an out-of-bounds access prior to the failing one (e.g. exceeds the bounds of the data object, but pointer lands in a neighboring data object that is properly allocated), or contain a race condition giving rise to one of the other conditions.