GPU polling for completion

I want to use the CPU in parallel with the GPU. So far I was giving the GPU a certain fixed amount of work to do.
Is it possible to poll the GPU for completion, in either runtime or driver API? If yes, how?

Yes, that is indeed possible. You do that by recording an event (cudaEvent_t) in a stream, following your last memory copy which you poll with cudaEventQuery.