Performance drop when changing Toolkit/SDK/driver from 4.2 ---> 5.0


I’ve actually changed two things:

  1. Migrated to 5.0
  2. Upgraded to Windows 8

I get a roughly 2x slowdown for some of my kernels. Should I attribute this to immature drivers for windows 8 or is it more likely related with the 5.0 Toolkit/SDK ?


Impossible to say without additional information. What GPU are you using? Can you show the code for one of kernels that sees drastic slowdown? When you add -Xptxas -v to the nvcc compilation, do the reported statistics differ significantly between 4.2 and 5.0?

I tried recompiling for 4.2 and timing ended up somewhere in between the results of Win7 / 4.2 and Win8 / 5.0.

One disturbing thing is that i noticed that the cudaMemcpyHostToDevice is now taking roughly 2x as long.

My card is a GK107 (Geforce) which sports PCIe 3.0, but now it’s as if I was running on PCIe 2.0.

About drivers: the Win7 notebook developer driver refuss to install on this laptop, however the gaming (Geforce) drivers do seem to be able to install without any errors being reported.

I’m guessing that since Win8 / 4.2 is faster than Win8/5.0 but still slower than Win7/4.2 this is more related to the drivers.


I am experiencing the same drop in performance since I switched from 295.xx drivers to 304.yy/310.zz drivers. I wonder what has changed in the newest drivers.

I also get a 2x slowdown for two of my kernels, taking into account only the computation time. I could check transfer time if needed.

My system is as follows:

  • Intel Core i7-3820 (X79 platform)
  • 2x GeForce GTX 680
  • CentOS 6.3 (64 bits)
  • Cuda toolkit 4.2

Some info about the problematic kernels:

ptxas info : Function properties for 1st kernel
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 56 registers, 480 bytes cmem[0], 8 bytes cmem[2]

ptxas info : Function properties for 2nd kernel
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 33 registers, 480 bytes cmem[0], 8 bytes cmem[2]

Both kernels are launched using 1024 blocks and 384 threads per block. I could post the code if you need it.


What I was trying to establish is whether there is a significant difference in the compiler statistics for the kernel between CUDA 4.2 and CUDA 5.0. Additionaly you could export CUDA_PROFILE=1 to compare kernel execution time and occupancy on the profiler log file.

For example, if you see much higher register usage or spilling, or significantly decreased occupancy, increased execution time for these kernels between CUDA 4.2 and CUDA 5.0, on the exact same hardware and source code base, there would likely be a compiler issue. Otherwise it may be a driver issue.

I have very infrequent access to Kepler hardware including GTX 680, so I am not personally aware of any gotchas that may be specific to Kepler and/or Windows 8.

Given the significant slowdown you experience, I think it would be helpful if you could prepare a minimal standalone repro app that demonstrates the performance drop between CUDA 4.2 and CUDA 5.0, and file a bug via the registered developer website. Thanks.

@mlopezp + @njuffa – no idea if this is a similar issue:

I have one kernel out of a large number that runs very slowly (5%!) on its first launch on Kepler but runs fine on Fermi and GT200. All successive launches in the same CUcontext run at full speed. I’m 99.9% sure it’s not due to cacheing of data or anything clueless/obvious. Other variants of the same kernel do not exhibit this problem.

I have not filed a bug yet – I will soon (sorry!).

If this sounds similar to your problem then perhaps you could inspect the runtime of the first launch vs. the second.

I’m on Win7/x64 + GTX 680.

@njuffa : I will do visual profiler runs for both Win7/4.2 and WIn8/4.2 to compare runtime and see which kernel is experiencing slowdown ( i have 12-15 kernels in my application).

I’m running two identical laptops with GT 650m side by side. System A has Win7/4.2 and System B has Win8/4.2 .

@mlopezp : I can confirm that the slower system B is running the newer 306.97 driver while the faster system A is running on 295.55 driver.

EDIT: According to system information my system B is running PCIe 2.0 which is consistent with the memcpy drop i observed.

The questions about the PCIe configuration are outside my area of expertise, but the information at the following link appears relevant:

Also this:

Very informally I have spotted something around a 5-10% slowdown on some (not all) kernels with Cuda 5 (and corresponding newer driver) on a 580 GTX on Windows 7 as well.

Looking in a bit more detail, one of my Kernels has significantly changed in register usage - from 20 to 41 under Cuda 5, this has changed the occupancy etc which may well have caused this effect.

Hi njuffa,

I’ve set CUDA_PROFILE=1 and then I’ve tested my codes (compiled with cuda 4.2) using both 295.75 and 310.14 drivers.

The performance drop in the following kernels is quite noticeably (about 2x):

  • cu_aggregation_z_t
  • cu_disaggregation_z
  • cu_aggregation_t
  • cu_disaggregation

Profiler logs:


Given that these are significant performance regressions, it would be helpful if you could file a bug for these via, attaching a self-contained repro app. There is a link a small distance from the top of that page:

Members of the CUDA Registered Developer Program can report issues and file bugs
Login or Join Today

If you are not a registered developer yet, no worries, it should be painless to sign up (turnaround for the confirmation is usually no more than one business day). Thank you for your help.

If it only happens on the first launch on Kepler, that sounds a lot like the the driver is having to JIT recompile for the Kepler architecture. What nvcc options are used when you compile the kernel initially? Are there a lot of instructions in it compared to your other kernels?

I had a similar thought when I first ran across the problem. It’s definitely not JIT’ing from PTX since I craft my own fatbin of cubins for all architectures – no PTX anywhere. (I wouldn’t know if any further micro-JIT’ing is performed on Kepler SASS). Also, it has been my experience that if you do provide kernels that require JIT’ing, they’re properly JIT’d before a kernel launch.

Furthermore, the kernel in question is one out of many and the latest version wound up being smaller than its previous incarnation… so it’s some sort of regression.

Needless to say, I dug pretty deep looking for the problem and I’ve been lax in not filing a bug. :|

Sorry for reviving old thread.

I did also upgrade from CUDA v4.1 with driver 301.42 to CUDA v5.5 with driver version 320.57 that came with CUDA Toolkit. All of my kernels suffer performance drop of about 10-20%.
I also tried updating to the newest driver 331.82 with ant performance drop persists.
Everything is tested on GTX 680 with 4GB memory. Did anyone manage to find the solution to this problem?
If I have time I will try different drivers in order to pinpoint exact driver that causes performance drop.

I have found out that performance drop appears with driver version 306.97.
Here is the change log for that version:
Recommended driver forWindows 8launch.
Adds support for the new GeForce GTX 650 Ti GPU.
Updates SLI profile forTom Clancy’s Ghost Recon Future Soldier.
Updates 3D Vision profiles for the following PC games:

  • Check vs. Mate - rated Excellent
  • Counter-Strike: Global Offensive - rated Good
  • Doom 3: BFG Edition - rated Excellent
  • English Country Tune - rated Good
  • F1 2012 - rated Good
  • Iron Brigade - rated Fair
  • Jagged Alliance: Crossfire - rated Good
  • Orcs Must Die 2! - rated Good
  • Planetside 2 - rated Not recommended
  • Prototype 2 - rated Poor
  • Sleeping Dogs - rated Good
  • Spec Ops: The Line - rated Good
  • Tiny Troopers - rated Fair
  • Torchlight 2 - rated Good
  • Transformers: Fall of Cybertron - rated Fair