reduces kernel launch latency?

CUDA profiles tell me that the average kernel launch latency of a 8800GTX is around 15us.

I just wonder whether the streaming feature in devices with compute capability 1.1 would help reduce the launch latency?

The intuition is when GPU is busy computing something, the DMA controller is transferred information from CPU mem to GPU mem (or vice versa). The information can not only includes data used for calcuation, but the binary program codes used to execute the next kernel.

Is my guess right?
Or does anyone have experiences in this aspect?

Thank you very much!


I believe the GPU is issuing the DMA, so streams will probably not help in reducing latency.

So you mean GPU can not issuing DMA and do calculation at the same time?

As far as I know, some PCI-e based FPGA boards can do both at the same time.

The lastest CUDA 2.0beta2 can specify the GPU driver behavior, such as spinning the CPU waiting for the results of GPU, which can reduce the returning latency.

But I couldn’t find anything that can improve the kernel invoking latency. Maybe I miss something.

Above all, my top in the wishlist is the ability to do CUDA programming in kernel mode. And if we can customize GPU driver more. We can significantly reduce latency to less than 2 us, as I have seen in a PCI-E FPGA board.

it can do DMA and calculation at the same time (you can copy data to/from GPU during calculation) But I don’t think you can do the preparation of a kernel-launch during calculation. For all I know, you might then be overwriting things you are still using (you are running a kernel at that time, so it needs the data that was prepared for that kernel)

Anyway, there is no known way of doing this at this time, so you will not gain anything. If 15 usec is too much overhead, then your kernel is not doing enough work.

Thanks for the answer.

I’m not very familiar with GPU architecture. But someone told me that the binary code (cubin file in CUDA case) is DMAed to the so-called “command buffer” in GPU, and the CPU instructs the GPU starts computing. If the command buffer is large enough and GPU has a smart scheduler, from architecture point of view it’s possible to pipelining the kernel invocation with minimum latency overhead.

I really hope NVIDIA can think of this. Or tell us fundamentally why they cannot do this so I can give up earlier.

I’m right now doing things that are really time critical.

I’m also doing a time critical application and I’d really appreciate if Nvidia could comment on the latency problem…

At least tell us if there is some hope of reducing it in the future (maybe with a later CUDA version?), or is it a lost cause because of hardware limitations?

Thanks a lot

At my opinion, using CUDA 2.0 modules and the low-level (driver) API may help. Unfortunately, module usage is not very well documented.