CUDA CPU Utilization

So CPU utilization when using the GPU can be reduced by sleeping in between polling calls to see if the GPU is busy.

My question is this:
In the future, will we be able to yield the thread while the GPU processes and then have the GPU signal back to the process to wake up when the computations are finished? Thus, we prevent this polling issue altogether and the CPU should sleep nicely while the GPU works.

You don’t need to have the CPU poll constantly, you can add a microsleep between tests. Yes, this is still polling, but it’s effective and reduces your CPU use to effectively 0%. What you lose is roughly half a millisecond of latency (whatever sleep quantum you use).

I do understand what you want, an OS level event that a thread can be waiting for with NO latency and NO polling, and we don’t have that. I don’t know if that’s an OS, a GPU driver, or a CUDA framework limitation.

My first statement reflects your first statement. Its seems like current GPU hardware designs are why we have to poll, sleep, etc in the first place. What I am asking is in the future, we will be able to call our CUDA function, have our thread yield, and then have the GPU wake the thread back up when finished.

Well, you’d still get significant overhead from the OS, so you’d probably be better off doing the sleep and poll behavior in the first place. Unless you just want some syntactic sugar to hide this behavior from you in the first place?

(pretty sure it’s an OS problem more than anything else–GPU fires an interrupt, CPU catches interrupt, switch to kernel mode, check to see what that interrupt should do, load thread state… yeah, that’s not free.)

I have an idea. The blocking functions in CUDA, especially the sync one, should do the sleep-polling themselves. That way you have low CPU usage and simple syntax. I guess NVIDIA did it the way they did to have low latency. Well, in that case NVIDIA itself should should do the fancy wake-on-interrupt that the OP is talking about. Then everyone is happy. Pretty straight-forward fix to the way things are now.

P.S. tmurray, i’m not sure what you’re saying. Yes you have to process the interrupt, but all that is happening anyway. Modifying the event handler to wake the CUDA runtime thread shouldn’t add much to it.

I think fix is not the right word. If the default behaviour gets modified, the HPC people will scream. Maybe there can be something done with streams and a standard polling function that gets fed the event to look for? Maybe with some macro?

There is already a setting to make the driver yield() instead of spin-looping, but that does not reduce CPU usage (relevant on e.g. laptops (yes, I know, few will use CUDA on that, but IMO it will be essential if CUDA is intended to be used for truly general-purpose stuff) and to easily see how much actual CPU the program needs).

They could add a flag/option that changes this behaviour, the way it is now I have to carefully tune the sleep time to make it work optimally, and ideally it would also depend on whether you have a tickless kernel etc. and on Windows it would need a completely different implementation because of the usually inaccurate timers.

I am almost certain NVidia could make a generic solution that would at least not work worse, probably it would work very much better and in addition be portable across platforms, not to mention avoiding every developer that needs that reinventing the wheel.

We are working on a couple of possible solutions, but we don’t have a timeframe yet.

It is appreciated, and just to make sure I want to clarify that my comment was not meant so much as a complaint, but mostly an opinion about the best solution to the problem.