Dispatch Kernel Overhead (OpenCL)

I recently profiled some kernel dispatches (pseudocode):

t1 = getCpuTimeStamp();
enqueueNDRangeKernel(…)
t2 = getCpuTimeStamp();

The actual GPU time stamps to execute the kernel is about .2 milliseconds. However, the CPU time stamps just for invoking the kernel dispatch is .7 milliseconds.

Is it really that expensive to do a dispatch? I kind of expected it to be super fast to enqueue.

I guess my main question is for those that have used CUDA and OpenCL, is the dispatch overhead way less in CUDA? Maybe NVIDIA has a bad driver implementation on OpenCL?

I should note my algorithm works on relatively small images, approx 500x500. But I have to process a lot of them. So it is a lot of upload, do calculation, download result. Other than doing an FFT and iFFT, there is little computation done in the kernels. Maybe these small images are not a good fit for GPU.

With the dispatch overhead, it loses to CPU implementation.

Are you measuring on a Windows platform, by any chance?

CUDA launch overhead for null-kernels is typically around 5 to 7 microseconds in sane driver environments. The default WDDM driver model used by Windows makes it much more expensive. The CUDA driver tries to counteract this by batching kernel launches, which reduces the average launch overhead but can also increase the launch overhead as seen by a particular kernel.

I would think that NVIDIA’s OpenCL implementation shares much of the low-level “plumbing” with CUDA, so presumably it is affected by the same WDDM overhead issues and batching artifacts. That is conjecture, of course.

I don’t know OpenCL. Its kernel launches may support features not supported by CUDA that increase launch overhead. Without knowing details of your timing methodology, it is unclear how reliable the measurements reported above are.

Yes, Windows 10.

It still seems strange to me that the overhead is that bad on Windows. On Direct3D you can dispatch draw calls to the GPU relatively fast. You can do 100s if not 1000s of draw calls per rendering frame before you see API draw submission overhead. I don’t know why dispatching a few kernels would be so slow.

I’m measuring and blocking on the input data uploads, so I don’t think it is doing a deferred memory allocation.

I do not use Windows 10 but I seem to recall reading elsewhere that the overhead problem has gotten slightly worse with WDDM 2.0 used by Windows 10. If your GPU allows you to make use of the TCC driver (not sure whether that is supported for OpenCL!), I would suggest using that as it eliminates the WDDM overhead issues.

I do not know how closely the Direct3D command issue mechanism is related to CUDA’s or OpenCL’s kernel launch mechanism; other than the general technique of placing commands and data in a push buffer, they may not have much in common.

As I said, the launch batching used by the CUDA driver (and, by extension, probably OpenCL) to lower the average launch overhead with WDDM can lead to spikes in the latency for a particular kernel launch. Depending on your timing methodology, you may be picking up such peaks. To get a better idea, you may want to

(1) use a timer with at least microsecond resolution
(2) do a warmup run before starting to measure (standard procedure for all benchmarks)
(3) use null-kernels (empty kernel that does not do anything)
(4) issue thousands of kernels back to back, tracking maximum, minimum, and average launch overhead

You may well find that the minimum overhead is much closer to the “ideal” 5-7 microsecond range I stated (e.g. 10-20 microsecond) than what you are currently measuring.

In general, launch overhead is not something you can do something about as a programmer. NVIDIA is well aware of the issue of launch overhead and tries to address it (e.g. the batching already mentioned). The CUBLAS and CUFFT libraries (and possibly others) shipping with CUDA have batch interfaces to support work on large-ish sets of small data items.

Generally speaking, if you have a fast CPU, properly vectorized and threaded CPU code, and the active data set can fit into the CPU’s last-level cache, it is often not worthwhile to attempt GPU processing. Likewise, it you need low-latency (rather than high throughput) processing, e.g. for high-frequency trading, doing the processing on the CPU may well be the best solution. GPUs are great for particular kinds of processing, but they don’t make CPUs obsolete. Instead, hybrid processing allows programmers to harness the strength of each (GPU and CPU), which is why I usually recommend pairing high-end GPUs with high-frequency (>= 3.5 GHz) quad-core or hexa-core CPUs.

its pretty rare a problem presents well to both gpu and cpu.

so i don’t ‘get’ why people pack their compute nodes with hundreds of gb ram and dual e5 v4s. when a tiny 100 xeon and 8gb ram is enough to load up the 6-8 gpus in that machine. as long as you have one thread per gpu its enough. its only useful when you don't have to pay for the hardware. ie. research scientists. in the real world we buy specific to the task! it means we can buy twice or more times the gpus for the same !

People who build clusters often have to support non-GPU workloads in addition to GPU-accelerated apps. There are also workloads that benefit from having a truckload of system memory, an order of magnitude larger than GPU on-board memory. So I can understand where these system builders are coming from.

If one is primarily interested in a small universe of GPU-accelerated workloads, I agree that installing dual low-frequency 24-core CPUs is not very helpful, and a waste of money. But even GPU-accelerated applications tend to contain latency-sensitive serial portions. Amdahl’s Law tells us that is something that should be addressed, and I would suggest this calls for CPUs with high single-thread performance. For that reason I suggest high-frequency quad-core or hexa-core Xeons that are more in the $350-$650 range (can you even get any Xeon for a mere $100?). Make sure they provide 40 PCIe lanes if you plan to use multiple GPUs.

8 GB of system memory is way too small for heavy lifting on scientific workloads, that is what I have on my 5 year-old low-end workstation here at home, and it is limiting. For a high-end machine one would want a sizeable DDR4-based memory subsystem. With four high-end GPUs in the system, 64 GB of system memory would seem about right.

Obviously, if all you run is a single GPU-based workload, you can custom-tailor a machine configuration to it. It sounds like this is what applies in your situation. I am curious: What is the application you are alluding to?

I was going to start debate with you but your points are valid and helpful for most people so I’ll save it for another thread.