For debugging purposes I would like to generate an error from kernel code.
In the manual it states that:
“The runtime maintains an error variable for each host thread that is initialized to cudaSuccess and is overwritten by the error code every time an error occurs.”
And indeed a thread can exit with an error and processing stops and the error is reported. So the code is there.
Is there a non-documented (or a hint to the documentation) possibility (other than e.g. provoking an error through an invalid memory access) to exit a thread with an error?
You need to differentiate between host threads and GPU threads. The error mechanism quoted from the documentation is a per host thread function inside the driver on the host. It doesn’t have anything to do with threads on the GPU. There is no mechanism inside the GPU to abort a kernel. About the best you can do is keep a global memory flag which each block reads atomically once at the beginning of the kernel and whose value causes a return of all threads within a block if it is set. Use an atomic memory operation inside the kernel to allow a thread to flag an error condition, which will then make all subsequent blocks exit.
However, when I force an invalid memory access (something like *invalidAddress = 0) in the kernel code all threads and blocks are instantly aborted and an error is reported to the host (I tried this). This is what I referred to when I said “the code is there”. There must be some error flag already, except it’s not accessible. So instead of forcing some error, I think it would be useful to to explicitely produce an error that does the same (by triggering something that is eqivalent to what happens when you access invalid memory).
There are certainly hardware level memory protection mechanisms and some type of limited programmable interrupts and counters on the GPU (I am presuming that is how profiling and events are implemented, for example). And the results of those are monitored by the driver- But as I said, none of that is exposed in the CUDA language, and further to that the sort of “hackish events” you can trigger inside a kernel, like a deliberate access of an invalid address, often result in a loss or corruption of the host GPU context/state, which makes their scope rather limited and difficult to rely on in real world code.
There is a trap instruction defined in the PTX documentation, but I don’t know how to generate it with the compiler - perhaps inline assembly might work. Might be something to consider.