Any way to signal invalid computation pattern inside a kernel Any custom error return mechanism


Is there a way to set CUDA error code when any thread detects a computation pattern that’s invalid. A simple use case could be indexing into any array out-of-bounds:

CUDA Memory buffer : float *in, float *out, // data in and out buffers

Index vector       : int   *idx,            // holds indexing values 1, 2, 3, -1 (invalid)

N                  : Array length

__global__ void getValues(float *out, float *in, int * idx, int N)


 int i = blockIdx.x * blockDim.x + threadIdx.x;

 if (i > N) return;

int index = idx[i];

 if (index < 0 || index >= N) // Can I set some field of CUDA error to signal this...

out[i] = in[index]; 


In the above case, can I set some custom bit field of CUDA error code, which I can catch with cudaGetLastError() call or something else.

I use trap, which results in “unspecified launch error” in the CUDA API. You can also set some application specific variable, or do device side printf…

#define DEVICE_ASSERT(EXP)                      \

do {                                            \

    if (!(EXP)) {                           	\

        // set some global variable         	\                           	

        asm("trap;");                       	\

    }                                       	\

} while (0)


As a related, but broader request: It would be nice to be able to return an arbitrary integer back from a kernel to the host without have to do a dedicated cudaMemcpy. A device function __exit(int status) that terminated the kernel and returned status would cover all the basic use cases, I think.

(You might want an extra flag that determines whether the call should terminate the entire kernel immediately, or whether all threads in the kernel should run to completion before status is returned.)

You could do that with a word of mapped host memory that gets written before the trap instruction (and zeroed from host code).

This is exactly what I do. But I also have a threadfence_system() call before the trap instruction.

The interesting way I use this is not to set some integer code or anything, instead I actually set the error as a pointer to a constant character string. This is weird and crazy because a constant string in the C file isn’t even ON the device, it’s on the host. But the device can set the POINTER to it. So when you get a non-zero response on the host, you get a host (!) pointer to the character string and you can print it! Funny but it works well, and convienient to code since you just do something like

#define STRINGIFY(x) #x


if (index>max_index) {

    *hostmappedErrorPointer="Fatal: Index out of range at line " TOSTRING(__LINE__);




This is so convenient since you need no predefined enum of error codes, or any mapping to print them. Super easy.

The stringify/tostring macro hacks are just to change the integer line number into a compile-time string which is compile-time concatenated with your inline error text.

Nothing stops you from setting more than one error variable, either, though there is a (in practice, common!) chance that multiple threads might all have fatal errors so their writes may be a race, so all the error variables MIGHT not all be from the same thread.

Ah! That is a great idea. I keep forgetting how easy it is now to map host memory into the device address space.

Also a good tip. I don’t know if the trap instruction can abort a queued up write to host memory, so a fence instruction before would be safest.