How does the CPU know when a GPU kernel finishes?

I’m wondering how does CPU know when GPU kernel finishes. Are interrupts used? But I’m afraid then handling the interrupt takes even longer time than the kernel. Our investigation showed that after invoking a kernel, the CPU calls yield several times. Does this indicate that polling is used to know? Since we only see multiple yields, does it mean after several polls, it falls back to interrupts?

It doesn’t unless you use cudaDeviceSyncrhonize() or some other form of flagging, I think.

You might be interested in looking into the callbacks:

3.2.5.5.6. Callbacks
The runtime provides a way to insert a callback at any point into a stream via
cudaStreamAddCallback(). A callback is a function that is executed on the host once
all commands issued to the stream before the callback have completed. Callbacks in
stream 0 are executed once all preceding tasks and commands issued in all streams
before the callback have completed.

I believe the runtime API creates a new thread to execute the callback and does not cause an interrupt to the main application thread AFAIK…

My understanding is that cudaStreamAddCallback() provides a way to call CPU functions once the GPU kernel finishes. But it doesn’t imply the mechanism that the GPU driver uses to know when the kernel finishes. My guess is polling, but I’m not exactly sure. I hope some NVIDIA people can let us know the answer.

The CPU/thread behavior at a barrier may vary. You may want to read:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g18074e885b4d89f5a0fe1beab589e0c8