Strange behavior between fermi and kepler

Hello,
We have an CUDA application created on Fermi architecture. Recently we have bought new GTX 690 (kepler) GPU. And our application takes significantly more time on kepler architecture. I have compared its performance in visual profiler. And there is strange thing on kepler. All kernels are launched with big delay (about 7seconds). As you can see in enclosed image. Top image is performance on fermi (actually I have tested it on two different fermi GPUs with similirar result. On bottom is image for kepler. Where are kernels launched after almost 7 seconds of nothing. Kernel execution time is almost twice better on kepler. But I dont understand why it is launched so late. Can somebody please help me.

We are using pinned memory and app is compiled sm20 for fermi and sm30 for kepler. fermi GPUs are running on Win7Pro, gtx690 is running on win server 20008 R2

External Media

Did you ever find out what was the issue?