CPU load when kernel is running why 100%?

Anyone can advise, why running kernel utilizes 100% of 1 CPU core?
I’m launching kernels via driver API.
Debian/lenny, driver 180.60.

My kernels are rather prolonged in time (from hundreds of ms to seconds), so I can see that CPU is loaded exactly when kernel is running (and not during memory transfers etc).

I’ve choosed CU_CTX_SCHED_YIELD when creating context. This reduced user load from initial 100% to ~25%, but there appeared about 75% system load, complementary to user load.
This could be an result of yielding (almost no other threads to yield, so execution returns back to this thread, and so on). However, this guess badly explains the termination behavior (when terminating a host, running a prolonged kernel, user load is reduced to 0%, but system load grows to 100%).

Any ideas what this could be?

My understanding is that when a kernel is running, the CPU is constantly polling the graphics chip to see if the kernel is finished, which is why the utilization for one core is 100%. The constant polling helps make sure that the overhead for each kernel call is as low as possible, which is critical for programs where the kernel calls are only 50 microseconds or so. Since your kernel calls are longer, the constant polling doesn’t help much. Unfortunately, I don’t think there’s a way to reduce the CPU utilization during kernel execution, but I do know that the issue has definitely been raised before :P

Say, I have this setup: CPU with two cores and two GPUs. I run two threads for each GPU (and constantly run many kernels one by one in each of them) and I run two threads for each core of the CPU that also do the work continuously. Does this mean that two threads that are created to utilize GPUs will bottleneck the whole setup and two cores of CPU won’t do much work regardless the fact that GPU threads do nothing except kernels execution ?

It’s important to note that the CPU hot spin only happens when a CUDA call blocks. Calling a global function is does not block (unless you’ve already queued up 24 or 32 calls already, depending on your card), but calling cudaMemcpy() after you’ve started a kernel does block, because the memory copy needs the kernel to finish so that global memory is in a consistent state. Thus, code like this will use both CPU and GPU:

  • call global to do GPU computation

  • Do CPU computation

  • call cudaMemcpy() to get GPU result

Of course, that’s kind of awkward if you want to keep your CPU and GPU calculations in different threads. There’s a way to setup your own slower polling loop (with appropriately sized sleeps) in the host thread making CUDA calls, but I can’t remember the commands off the top of my head. Hopefully someone else can help here.

No, GPU execution is asynchronous… your CPU isn’t used at all after launch.
What is misleading everyone is that if you queue up a SYNCHRONOUS event in the stream, usually a memcopy device->host, THEN you’re asking the CPU to poll, and that can peg the CPU if it has to wait.

To avoid that, use asynchronous memory copies, and stream events to signal when streams are complete.
This is in the programming guide… check the stream section, and events. There are also multiple threads here on the forum.

This is where the new CU_CTX_SCHED_YIELD option comes in. Yielding is what it sounds like: the thread politely asks if anyone else would like a turn on the CPU. If there are such threads, they get to do work. If there aren’t, the yielding thread returns to polling the GPU.

This is in contrast to the original, more selfish mode, which could indeed cause conflicts.

But this is still not the ideal option for some, who would like the blocking thread to sleep and wait for an interrupt, and not do any polling (even polite polling) at all. This would let the CPU utilization drop toward 0%.

That’s what I mentioned in my initial post (ColinS and Romant might missed that).

I use this option, still having 100% utilization.

Sure I’m using asynchronous calls.

Do you think CU_CTX_SCHED_YIELD option affect any synchronization performed by driver (including internal in other API calls), or only cuCtxSynchronize()?

Ah, but you’re calling a function that is waiting for a result from the GPU.

As Alex says above, CU_CTX_SCHED_YIELD does not mean you drop to 0% CPU usage.

It means that the polling thread spins but gives up its timeslice if there are other CPU processes.

That still pegs your CPU meter but doesn’t steal as much of the CPU timeslice for itself.

I don’t know for certain at all, but I suspect the normal spin is something as simple as:

while (!GPUDone());  // tight spin

and the “Yield” option is something like:

while (!GPUDone()) Sleep(0); // Win32 option to yield CPU timeslice to any other threads

Note that Sleep(0) does nothing to reduce pegging your CPU usage, it’s just about giving other (equal priority) threads more slices.

For 0% CPU, you likely want to use events, perhaps multiple streams, and your own (low-rate) polling.

Is there a way to utilize CU_CTX_SCHED_YIELD without necessity to go down to driver API ?

Use a stream, signal the completion with an event, then make your own spinloop test for the finish event with your desired sleep or yield in it.

Yes: http://forums.nvidia.com/index.php?s=&…st&p=472418

Huh … I’ve skipped that thread in the past as it’s beginning seemed useless for me, looks like that was a bad idea :-) Thank you!

In other words, I can loop through my devices (I have two GPUs), create a context for each of them with that flag and have fun ?

Loop through? I’m not sure what you mean.

If you want to use multiple GPUs simultaneously, you need to initialize each of them separately in their own host thread. You will get an error (or incorrect behavior) if you try to switch from one GPU to another within the same host thread.

Under ‘loop through’ I mean ‘enumerate them one by one’. Excuse my English, please. Understood about threads, thank you.