Performance drops down with 177.84 driver Old 174.55 driver has better performance

My program generates lots of small computation tasks, running asynchronously, with other computations being done on CPU at the same time. It works great with 174.55 driver and CUDA 2.0 Beta 2. I have tried the recently released CUDA 2.0 and 177.84 driver. Unfortunately, CUDA part gets dramatic performance dropdown after migrating to 177.84 driver. Actually I have noticed the same performance loss with 177.35 driver, however I hoped that in official release this bug will be fixed. The problem is with driver and not with CUDA toolkit: I have rolled back to 174.55 and checked the program re-compiled with CUDA 2.0 and performance is OK.

With 177.84 driver I have noticed much higher kernel CPU usage level than with the older driver.

My program is not GPU-intensive: it works fine on 8600GT and even on 8400GS. In fact, it does most of computations on CPU while using GPU as a co-processor for certain pieces of code.

I use NForce 570 SLI motherboard, AMD Athlon 64 X2 3800 CPU, Windows XP SP2, 2GB Ram. I work with driver API. Unfortunately, I can’t put here the full code of my program, however I can share CUDA - related code fragments.

I’m ready to answer any questions.

I noticed similar performance dropdown between 177.11 and 177.35.


Thank you Vasily. Could you advice me where can I get the 177.11 driver? I’d like to check it with my program.

You need to login at I believe you need to be a registered developer to do that.

I have found 177.26 driver here:

Installation of this driver with modified .inf file (taken on the same site) was successful. I got same performance as with 174.55. So, the performance issue has appeared somewhere between version 177.26 and 177.35.

With 177.26 driver & 8400GS card, the GPU version of my program runs 40% faster than the pure CPU version (and it’s a good result). With 177.35 driver the GPU version runs 3.5 times slower than with 177.26 driver and 2.5 times slower than the CPU version.

There’s definetly a bug in the newer driver, and I hope that NVidia developers will put some effort into finding and fixing it.

At some moment while playing around with different cards & driver versions I got the same slowdown with 177.26 as with 177.35. Now I can’t reproduce normal performance of my program with 177.26 (uninstallation does not help), but with 174.55 it is still OK.

I’m curious what’s the official stance on this issue. Any comment from Nvidia?

Is this a common problem? I can’t check on my machine, I’ve had my GPU sent for repairs.

KonstantinT, did you try benchmarking other programs (for example projects from the SDK)?

I’m seeing a big drop (50%!!!) in host to device bandwidth from driver 174.55 to 177.67. I used to get 1.5GB/s. Now I only get 730MB/s.

I got these numbers using the bandwidthTest application in the CUDA SDK. If you have your results for this app, could you put them up? Thanks.

could you try same test with -memory=pinned option? Where did you get 177.67 driver?

I got 177.67 on the CUDA download page. I’m using RHEL4.

With pinned memory, I got 2.3GB/s host-to-device on 174.55 and again around 730MB/s on 177.67.

Here is a reproducer for a performance degradation with texturing that we see with the 177.67 driver.

Symptoms: A simple kernel performing tex1D and compiled with Cuda 1.1 exhibits ~100% performance degradation when run with the 177.67 driver relative to 173.14.05
Timing determined using the cuda profiler. Code compiled with the 2.0 release exhibits the same problem.

The kernel uses tex1D to perform linear interpolation on a 1D array of 4096 float4s. Coordinate normalisation and clamping are in effect.

Operating System: FC 8, 64 bit.
GPU: 8800GT 512MB
CPU: Intel E6650, Intel chipset


Device-to-device bandwidth is 5-10% lower with 2.0 versus 1.1. You should not be seeing any drop in host-to-device or device-to-host bandwidth as far as I’m aware, though.

I’ll have to look at this texturing code and see what’s up with that.

Is this a problem that Nvidia recognizes and will fix or should we just ‘deal with it’? Or is it a case of “feature, not a bug”? :rolleyes:

I’d also like to know the answer to this. Also, is it CUDA 2.0 or the drivers that are the problem?

Having moved from an 8800GTX to a GTX 280 with a memory clock of 1.225GHz, I was hoping to see device-to-device bandwidth in the area of 235GiB/s, but bandwidthtest reports only 54% of the theoretical max:

Device to Device Bandwidth

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 127552.8

I hope this is just a driver bug.

GTX280 has 512b memory interface. So, at this clock rate you should get 1.225512/82=157GB/s at pins. 127GB/s is 81% of that, looks pretty good. How did you get 235GiB/s?

We know device to device bandwidth is down, we know it’s lame, we’re working to improve it. I’m not going to tell you it’s going to magically jump back up to 174.xx levels, but we are working on it. Hopefully you’ll see some performance increases there in the not-too-distant future.


What about host to device speeds? It seems to be consistently slower. I’ve tested it on 3 to 4 different machines.

It’s supposed to be DDR3, which would be *3 instead of *2; but, perhaps I misunderstand how it works.

No, I think DDR means Double Data Rate. Also: