WDDM on windows 7 and kernel call overhead

Hi all,
i’m having a lost of problem running my GPU code under windows 7. Under linux I can get easily 45Gflops on my Tesla (with is good for this application…) but on windows i’m stuck at 7Gflops.
I use a lot of small kernel (i know this is bad …) and i read something about WDDM which increase latency at each kernel call (i read 40us instead of 3us !!)

Any clue, anything possible to speedup windows execution ?

Thanks a lot !

downgrade to Windows XP (32 or Prof. 64 bit), if this is an option.

Alternatively try using Tesla drivers on a non display CUDA card (INF files can be hacked I think

to work on Geforce boards)