The driver overhead cputime >> gputime

I’m current working on an algorithm that requires hundreds of small passes. And now I found out for small datasets, actually more time are spent in the driver (profiler’s cputime) than actually computing (the gputime). The passes are very dependent on each other and can’t be reduced much now (I worked for weeks to reduce it from thousands to mere hundreds, actually). So is there a way to reduce the driver overhead? Or does nvidia guys plan to optimize this?