Do the non-async calls sleep or burn CPU?

Using CUDA in a truly asynchronous event loop would be much easier if you could use it with select() or poll() system call (given a file descriptor), then just wait for some GPU completion event in the same way that you would wait for data from another thread or process. Now it has to be handled specifically, in a quite awkward way.