Performance of Lauching Kernels

Dear All

I tried to launch a cufft library function and with profiling it takes about 139 us to launch the kernel and 19 us to compute the fft (8192 of single precision). This in a Geforce 740M and i7. I launched another kernels and they took about the same time to launch. Could I expect the same latency (139 us) if I launch the kernels inside another kernel? The same, If I launch a kernel from another kernel can I expect the same latency from the launching from the host cpu? Not taking in account the memory transfer between host and GPU. How can I minimize that latency?

Thanks

Luis Gonçalves

I am not sure how the 139 us were measured, and what is included in that time. An actual kernel launch with no work performed inside the kernel takes approximately 5 us on the Linux system I use on a daily basis, so that could be considered the minimum launch overhead.

What OS are you using? If you use Windows with the WDDM driver, the measurements could be distorted by the launch batching the driver performes to mitigate the high overhead of this driver model. Also, use of some profiler functions could add overhead to kernel launches. For low launch overhead on Windows, you would want to use the TCC driver, which however does not work with consumer cards alst I checked.

One strategy to minimize the performance impact of launch overhead is to give each kernel as much work to do as possible, for example by using a batch interface where available.

I would guess it is WDDM from the times involved, as I see similar effects. This is a real frustration for me as for my project it involves lots of sequential kernel calls each of which actually take longer than the execution so my performance is simply limited by the kernel launching (lots of FFT → transform → IFFT → transform → FFT)) I was hoping that there would be a dynamic-parallelism version of cuFFT but it appears that has gone so quiet that I have come to the assumption it will never happen. I know Teslas would be the best bet but the cost difference is pretty significant.

If using Tesla GPUs with the TCC driver on Windows 7/8 is not an option, you may want to consider using Linux. I have an older dual boot Linux / Windows XP machine, and therefore am aware that the Windows XP driver and the Linux driver have the same low overhead. I would not expect Windows XP to be a viable option for most at this time.

My understanding is that the launch batching performed by the CUDA driver when running with WDDM significantly reduces the average launch overhead, but it is expensive to kick off each batch. I do not know what the average cost is, so I can’t say whether the 139 usec reported above relates to the cost of kicking off a batch of launches or the average cost of a launch. It might be worth an experiment to establish the maximum kernel launch rate (e.g. of empty kernels) when using the WDDM driver.

[Later:]

I ran some quick experiments on a reasonably high-end Windows7 machine, with a 3.5 GHz Sandybridge CPU and a Tesla C2050, using the latest drivers. Averaging across 500 kernel launches, I see kernel launch overhead of 3-10 usec per launch. If I add a call to cudaThreadSynchronize() after each kernel call it takes 50-100 usec on average.

By comparison, I measured the average kernel launch overhead to be 3.7 usec under RHEL Linux, and 16 usec with cudaThreadSynchronize() added. Under Windows XP64 the numbers are 3.5 usec and 14 usec. These are, as stated, averages. I did not have time to look at the distribution. I would expect fairly wide variance for individual launches, so when assessing the kernel overhead it would be best to look at a larger collection of kernel launches, rather than an indiviual one.

“Your mileage many vary”, that is, you may see different overhead on your system depending on the performance of the CPU, the GPU(s) in the system, the driver version, and probably a bunch of other factors.

It seems that synchronization adds a relativly large amount of overhead with the WDDM driver model, so in practical terms one would want to minimize the amount of synchronous / synchronizing API calls in a CUDA application. I would consider that a “best practice” anyhow.