My program generates lots of small computation tasks, running asynchronously, with other computations being done on CPU at the same time. It works great with 174.55 driver and CUDA 2.0 Beta 2. I have tried the recently released CUDA 2.0 and 177.84 driver. Unfortunately, CUDA part gets dramatic performance dropdown after migrating to 177.84 driver. Actually I have noticed the same performance loss with 177.35 driver, however I hoped that in official release this bug will be fixed. The problem is with driver and not with CUDA toolkit: I have rolled back to 174.55 and checked the program re-compiled with CUDA 2.0 and performance is OK.
With 177.84 driver I have noticed much higher kernel CPU usage level than with the older driver.
My program is not GPU-intensive: it works fine on 8600GT and even on 8400GS. In fact, it does most of computations on CPU while using GPU as a co-processor for certain pieces of code.
I use NForce 570 SLI motherboard, AMD Athlon 64 X2 3800 CPU, Windows XP SP2, 2GB Ram. I work with driver API. Unfortunately, I can’t put here the full code of my program, however I can share CUDA - related code fragments.
Installation of this driver with modified .inf file (taken on the same site) was successful. I got same performance as with 174.55. So, the performance issue has appeared somewhere between version 177.26 and 177.35.
With 177.26 driver & 8400GS card, the GPU version of my program runs 40% faster than the pure CPU version (and it’s a good result). With 177.35 driver the GPU version runs 3.5 times slower than with 177.26 driver and 2.5 times slower than the CPU version.
There’s definetly a bug in the newer driver, and I hope that NVidia developers will put some effort into finding and fixing it.
At some moment while playing around with different cards & driver versions I got the same slowdown with 177.26 as with 177.35. Now I can’t reproduce normal performance of my program with 177.26 (uninstallation does not help), but with 174.55 it is still OK.
Here is a reproducer for a performance degradation with texturing that we see with the 177.67 driver.
Symptoms: A simple kernel performing tex1D and compiled with Cuda 1.1 exhibits ~100% performance degradation when run with the 177.67 driver relative to 173.14.05
Timing determined using the cuda profiler. Code compiled with the 2.0 release exhibits the same problem.
The kernel uses tex1D to perform linear interpolation on a 1D array of 4096 float4s. Coordinate normalisation and clamping are in effect.
Device-to-device bandwidth is 5-10% lower with 2.0 versus 1.1. You should not be seeing any drop in host-to-device or device-to-host bandwidth as far as I’m aware, though.
I’ll have to look at this texturing code and see what’s up with that.
Having moved from an 8800GTX to a GTX 280 with a memory clock of 1.225GHz, I was hoping to see device-to-device bandwidth in the area of 235GiB/s, but bandwidthtest reports only 54% of the theoretical max:
GTX280 has 512b memory interface. So, at this clock rate you should get 1.225512/82=157GB/s at pins. 127GB/s is 81% of that, looks pretty good. How did you get 235GiB/s?
We know device to device bandwidth is down, we know it’s lame, we’re working to improve it. I’m not going to tell you it’s going to magically jump back up to 174.xx levels, but we are working on it. Hopefully you’ll see some performance increases there in the not-too-distant future.