Host CPU pegged while CUDA kernels run?

I’m working on a CUDA program that’s now performing quite well on a single GPU. Before contemplating doing more with threads etc, I ran some tests watching the host CPU load while the CUDA program runs. In my tests, I found that the host CPU gets pegged at 100% utilization during the period while the CUDA kernels are running, presumably in the CUDA runtime library spinning while waiting for the CUDA kernel to complete. I find this to be true whether the CUDA kernels run for 1/20th of a second, or as long as 5 seconds. The kernel invocations aren’t doing any memory transfers, so I don’t see any particular reason that the host CPU should be loaded while waiting on the CUDA kernel to complete. I had asked David Kirk about this earlier this week and he indicated that if I observed this kind of behavior that it should be considered a bug, so I thought I’d report it here since I’ve now tested for it and observed it occuring with my test program. My concern about host CPU load is simply that I would like to spawn threads to manage the CUDA execution while the main CPU continues on doing useful work in different thread rather than spinning while wating for the CUDA kernel to complete. Any comments or suggestions are welcome. This was tested on RHEL4u4 with the beta release of CUDA etc.

Thanks,
John

Please provide a test app which reproduces this behavior (including build & run instructions).

Thanks,
Lonni