Anyone can advise, why running kernel utilizes 100% of 1 CPU core?
I’m launching kernels via driver API.
Debian/lenny, driver 180.60.
My kernels are rather prolonged in time (from hundreds of ms to seconds), so I can see that CPU is loaded exactly when kernel is running (and not during memory transfers etc).
I’ve choosed CU_CTX_SCHED_YIELD when creating context. This reduced user load from initial 100% to ~25%, but there appeared about 75% system load, complementary to user load.
This could be an result of yielding (almost no other threads to yield, so execution returns back to this thread, and so on). However, this guess badly explains the termination behavior (when terminating a host, running a prolonged kernel, user load is reduced to 0%, but system load grows to 100%).
My understanding is that when a kernel is running, the CPU is constantly polling the graphics chip to see if the kernel is finished, which is why the utilization for one core is 100%. The constant polling helps make sure that the overhead for each kernel call is as low as possible, which is critical for programs where the kernel calls are only 50 microseconds or so. Since your kernel calls are longer, the constant polling doesn’t help much. Unfortunately, I don’t think there’s a way to reduce the CPU utilization during kernel execution, but I do know that the issue has definitely been raised before :P
Say, I have this setup: CPU with two cores and two GPUs. I run two threads for each GPU (and constantly run many kernels one by one in each of them) and I run two threads for each core of the CPU that also do the work continuously. Does this mean that two threads that are created to utilize GPUs will bottleneck the whole setup and two cores of CPU won’t do much work regardless the fact that GPU threads do nothing except kernels execution ?
It’s important to note that the CPU hot spin only happens when a CUDA call blocks. Calling a global function is does not block (unless you’ve already queued up 24 or 32 calls already, depending on your card), but calling cudaMemcpy() after you’ve started a kernel does block, because the memory copy needs the kernel to finish so that global memory is in a consistent state. Thus, code like this will use both CPU and GPU:
call global to do GPU computation
Do CPU computation
call cudaMemcpy() to get GPU result
Of course, that’s kind of awkward if you want to keep your CPU and GPU calculations in different threads. There’s a way to setup your own slower polling loop (with appropriately sized sleeps) in the host thread making CUDA calls, but I can’t remember the commands off the top of my head. Hopefully someone else can help here.
No, GPU execution is asynchronous… your CPU isn’t used at all after launch.
What is misleading everyone is that if you queue up a SYNCHRONOUS event in the stream, usually a memcopy device->host, THEN you’re asking the CPU to poll, and that can peg the CPU if it has to wait.
To avoid that, use asynchronous memory copies, and stream events to signal when streams are complete.
This is in the programming guide… check the stream section, and events. There are also multiple threads here on the forum.
This is where the new CU_CTX_SCHED_YIELD option comes in. Yielding is what it sounds like: the thread politely asks if anyone else would like a turn on the CPU. If there are such threads, they get to do work. If there aren’t, the yielding thread returns to polling the GPU.
This is in contrast to the original, more selfish mode, which could indeed cause conflicts.
But this is still not the ideal option for some, who would like the blocking thread to sleep and wait for an interrupt, and not do any polling (even polite polling) at all. This would let the CPU utilization drop toward 0%.
Do you think CU_CTX_SCHED_YIELD option affect any synchronization performed by driver (including internal in other API calls), or only cuCtxSynchronize()?
If you want to use multiple GPUs simultaneously, you need to initialize each of them separately in their own host thread. You will get an error (or incorrect behavior) if you try to switch from one GPU to another within the same host thread.