Small kernels are slow...

I have a program which starts many small kernels. CUDA Profiler shows, that CPU Time is about 5 times bigger than GPU Time.
I don’t use any streams (if this makes a difference). I have Vista SP2.
Is there a way to make these kernels run faster? I think I can’t “connect” kernels together - dimensionality is always different.

Dimensionality can be solved with if statements :)

I have similar bad experience with lots of kernels, but when I tried gluing them together with dirty hacks including GPU-wide synchronisation bars, the performance was worse (just a bit, like 5%, but still).

Currently my program launches about 400 kernels in 100ms, some of which are quite demanding for GPU, otheres are very simple and I guess I will have to live with that launch overhead…

Note however, that if you launch several kernels one after another, without waiting for results on the host side, the overhead is not big. Kernel launches are asynchronious and host is already preparing next kernel for launch when previous is running on GPU.