How can I halt an entire kernel?

I have a nice application that requires that I be able to halt a kernel. It is essentially a search operation, and looks for an element of 1 in a matrix, the better to perform row and column reduction. The details are a little unimportant, but what matters is that I’m wondering if there is a way to have one thread stop the entire kernel execution.

Not really, no.

Maybe you could poll a shared boolean after a fixed number of operations/iterations in order to determine whether the threads should continue or exit using a return statement.
See section B.5 of the programming guide.

N.

If it makes you feel better, the reason I could answer this so quickly is because I have spent a long time thinking about this problem… :)

Maybe you could add it to the CUDA wish list, Tim :D

N.

Yeah, you can do it by using a periodic poll of some device memory flag… done by every block. It’s messy but does work. But the hassle is really annoying… I’ve found it easier to just make small fast kernels and do the check between kernels.

I don’t think it can work, what if the “terminator” thread takes longer than some threads, and consequently writes the boolean later than the check?

You have some kernels that continue and some that don’t.

And i think that all the kernels accessing the same shared bool should cause some heavy bank conflicts…

It’s much worse than that. You need interblock communication since you want to shut down an entire kernel early, so you need to do device memory polling, not (just) shared memory polling.

So you end up putting a (high latency!) test into your block’s work loop where one thread checks the global memory for the “stop it!” flag. If it’s found, that thread writes to SHARED memory so all the threads in the block can shut down. I even experimented with using one warp only for polling and the remaining warps for work, but that’s a mess too since you need to remerge for __syncthreads() barriers.

Efficiency wise, it’s not really TOO bad, but it’s really just easier to call a thousand short 3ms kernels rather than 3s kernels with completion polling.

One probable way is to write to a host-memory location (using zero-copy method) – which the host-application/driver would poll for and then request the GPU to stop executing the kernel.

Thats a decent way to do it. But right now, the driver does NOT have an API to halt a kernel (I am not aware of)

You could do “cudaThreadExit()” from host side… but that will destroy the context. So you wont get any result from your kernel.

I used “cudaThreadExit()” many times but what exectly it do that is not clear to me.

Could you explain bit more?

cudaThreadExit destroys the context. I think all CUDA pointers in that context would become invalid after a call to cudaThreadExit().

Extending this weird? idea, one could make this zero-copy phys address as the interrupt-trigger address of the GPU.

But Not sure if PCI-E allows the Master and Target to be the same device…

I just came up with one idea, how about triggering GPU exception to terminate the kernel? (such as some invalid memory read or write to global/shared memory)
However I am not sure whether the partial result, which has been written to global memory or host memory, would still be valid or not after throwing out the exception.

Segfaulted kernels dont return partial results. I have encountered this behaviour before.

Thats why we need to gently request the GPU to stop and that has to come from the driver…

What is needed is a back-channel for the kernel to communicate to the driver.

If the GPU had a service processor, it could probably read a special memory location and signal an interrupt to the driver. But not so sure about the architecture of GPUs.

It’s also possible that the only way to halt a kernal in process is to send a reset signal to the hardware.

Anyway, right now, the best way to stop a kernal would likely be to just have a flag in memory. The signaling thread sets the flag, and all threads periodically poll it, and return if they see it set. By setting the polling rate moderately low, say, once per 10 iterations, the polling won’t effect performance very much, at the cost of the response time being a bit longer.

What an ugly hack!

You should be proud of yourself, that’s really clever. I bet it’d work.

If you call threadfence() first it will likely allow all pending writes to complete safely before your kill. Of course this is clearly not guaranteed in any way.

Except that it invalidates your context.

Meaning - the partial results wont be available. Segfaulting is not an option.