Performance degradation with CUDA 1.1 (caused by background process)

Sorry for posting several topics in a row, but I really can’t fgure out what’s going on.

I have recompiled my program with 1.1 and see approx. 2.5x performance degradation :(

CUDA profiler gives folowing results:

  • for SDK 1.0 with 162.01 drivers:
memcopy,2.528

memcopy,2.400

_kernel,6005.313,6018.627,1.000
  • for SDK 1.1 with 169.09 drivers:
memcopy,2.496

memcopy,2.944

_kernel,6007.840,15108.581,1.000

For some reason cputime is 2.5x the gputime, which explains the slowdown.

This problem occurs only if I have some other CPU-intensive process running in background (with idle priority), even if I run my program at normal priority. CUDA 1.0 seemed to respect host process priorities more or less, at least there were no such degradation caused by processes running at idle :(

I met the same problem too.

Can someone from NVIDIA please comment on this issue? Will this be fixed in upcoming driver releases?

In CUDA 1.0, there was a lot of negative feedback about excessive CPU utilization from the busy wait in functions such as cudaThreadSynchronize(). So for CUDA 1.1, we added a thread yield if the GPU is still busy. This change dramatically improved multi-GPU scalability in our testing, without any obvious adverse performance changes. But, the symptoms you describe are consistent with that change: the 1.1 driver is yielding in its busy wait when the 1.0 driver did not.

There is a tension between yielding and not yielding. If we do not yield, the CPU is kept busy doing something that isn’t very productive; if we do yield, the CPU may context switch to lower-priority processes.

The good news is that with streams and events in 1.1, apps have the tools needed to coordinate execution with the GPU themselves. You can check whether the GPU is busy with cu(da)StreamQuery(0), for example. You could even implement your own busy wait with its own heuristics (and no thread yield). Or, find something else for the CPU to do while the GPU is working.

Meantime, we will continue working to improve the heuristics that the driver uses to decide when and how to yield.

Problem with 1.1 is that processes with low priorities affect performance of CUDA programs running at higher priorities.
I’ll try to find a workaround for this with streams and events API. Thanks for reply.