CPU Utilization during kernel execution

I am running the vectorAdd example from the SDK and monitoring GPU/CPU utilization with the timeline analysis of Parallel Nsight. I notice that during kernel execution the CPU utilization is always at 100%. How can that be? From the timeline, it shows the process that launched the kernel is busy executing cuCtxSynchronize.

Any idea?


I think you will find this is actually an artifact of profiling. To get all of the profile data, the profiler winds up “decorating” API calls with additional events and timers, in the process serializing a lot of things that would otherwise be asynchronous. What you are seeing is probably due to an event or timer requiring the calling CPU thread to sit in a spinlock. In normal execution it shouldn’t happen.

Thanks for your quick answer.

If this is the case, how would you profile the CPU utilization during kernel execution?

Just use the CPU code profiling tool of your choice. You don’t get to see what the GPU is doing internally, but if you are only interested in what the CPU threads are doing, it doesn’t really matter.

Thanks a lot for your answer.

Then I don’t see the all point of providing an analysis tool which includes CPU performance if these are not correct.

Many of them are probably correct, it is only asynchronous calls that aren’t. If you want to see how long a cudaMemcpy call takes, the profiler will show something pretty close to what you would get if you timed it yourself. If you want to see how much overlap a pair of cudaMemcpyAsync calls are getting with a pair of running kernels, it won’t.

I think by default the CPU thread that waits on a GPU kernel to complete will keep polling the GPU, in order to detect the termination of the kernel ASAP. Such polling keeps the CPU busy. This is reminiscent of how old DOS programs used to query the keyboard for the key press event rather than waiting on the keyboard interrupt.

I invoke cudaSetDeviceFlags(cudaDeviceBlockingSync) at the GPU initialization stage and am observing about 10% of CPU utilization while my kernels are being executed on the GPU. I’m not sure if cudaDeviceScheduleYield flag is more relevant. I don’t know how much longer it takes for the CPU thread to detect the kernel completion in cudaDeviceBlockingSync mode. In my case of a large number of complex kernels this delay seems to be relatively small. My platform is Linux, but I don’t see why Windows would be any different in this respect.