For some reason cputime is 2.5x the gputime, which explains the slowdown.
This problem occurs only if I have some other CPU-intensive process running in background (with idle priority), even if I run my program at normal priority. CUDA 1.0 seemed to respect host process priorities more or less, at least there were no such degradation caused by processes running at idle :(
In CUDA 1.0, there was a lot of negative feedback about excessive CPU utilization from the busy wait in functions such as cudaThreadSynchronize(). So for CUDA 1.1, we added a thread yield if the GPU is still busy. This change dramatically improved multi-GPU scalability in our testing, without any obvious adverse performance changes. But, the symptoms you describe are consistent with that change: the 1.1 driver is yielding in its busy wait when the 1.0 driver did not.
There is a tension between yielding and not yielding. If we do not yield, the CPU is kept busy doing something that isn’t very productive; if we do yield, the CPU may context switch to lower-priority processes.
The good news is that with streams and events in 1.1, apps have the tools needed to coordinate execution with the GPU themselves. You can check whether the GPU is busy with cu(da)StreamQuery(0), for example. You could even implement your own busy wait with its own heuristics (and no thread yield). Or, find something else for the CPU to do while the GPU is working.
Meantime, we will continue working to improve the heuristics that the driver uses to decide when and how to yield.
Problem with 1.1 is that processes with low priorities affect performance of CUDA programs running at higher priorities.
I’ll try to find a workaround for this with streams and events API. Thanks for reply.