2X slow-down on Tesla but under profiler the speed is restored

I have a kernel that executes in 40 ms on one C870 card of my S870 Tesla, but runs in only 20 ms on the Quadro FX 5600 in the same machine. It also runs in 20 ms on a GTX 8800 on a different machine. (BTW, the 5600 and the 8800 do have X running on them.)

Furthermore, under the cuda profiler (cudaprof): If I run with no counters enabled on the Tesla, then it still runs in 40 ms. However, if I turn on any one or more counters in the profiler, it runs in only 20 seconds.

How can I restore the Tesla performance to what it should be?

Please post a test app which reproduces this problem.

BTW, I can see basically the same problem when I run some of the SDK projects. For example, alignedTypes is a good case - I disable all but the first call (for uint8) for simplicity.

Does this problem reproduce if X isn’t running ?

Please generate and attach an nvidia-bug-report.log from your system.

No, the problem goes away if I kill X, and returns after restarting X.

Attaching log file:
nvidia_bug_report.log.txt (211 KB)

Go figure. Today, my code is running at the faster speed under all circumstances. However, alignedTypes is still suffering the problem described above.