Reasons why GTX 460 is faster than GTX 480

I am facing the strange problem that my program is slower when using a GTX480 than on a GTX460.

Remarkably, the kernels themselves in fact are faster on the 480 as expected.

However, it seems that all memory-management around is signigicantly slower.
The calls to cudaMallocArray for example take roughly twice as long on the 480.

I am using the same driver and toolkit version on both cards, and compile for sm20 (or sm_21 makes no difference).

The same code is again even slower on a GTX690, but maybe this is because I would need sm_35 there?

Thanks in advance,

Is this a controlled experiment, i.e. are you replacing the GTX480 with the GTX460 within the same system without making any other changes?

Well, it’s three different machines and the GTX480-system is in fact a bit older than the other two. However, I would have expected to gain performance instead of loosing. Especially the 690-machine is driving me crazy. It’s a high-end machine, but the same binary is slower than on my 460.

There are a number of things which can be going on here. The broadband speed may be different on the machines, etc.

Run the broadband speed test sample from the SDK on all machines, and also the cuBLAS Matrix Multiplication on all three. cuBLAS is pretty good at determining the configuration of the machine and changing the block size etc accordingly.

The bottom line is that the 690 is much faster than the 460 or 480, but a single piece of code cannot be used as the only metric.

Oh and yes changing the compute capability to 35 will make a big difference. Also if on a 64 bit machine, make sure your properties are set correctly.

Not really. They cannot run the same binary as they are binary incompatible (GTX 690 is a compute capability 3.0, GTX 480 is 2.0). They only appear to run the same binary through the virtue of just-in-time recompilation.
And BTW if you want to avoid JIT recompilation on the GTX 690, you need to compile for sm_30, not sm_35.

Just to state the obvious, JIT compilation happens at runtime and will increase your application’s time to completion. It is best to build a fat binary which contains machine code for each compute capability you plan to run on.

It’s not clear how much your applications overall runtime is due to code running on the host. If it is a significant portion, the performance of you host system will factor into application runtime. Likewise, the CUDA driver executes code on the host, and the higher the single-thread performance of the host system is, the less host-side driver overhead will there be.

In terms of software configuration, make sure you run identical and recent driver versions on all platforms. Note that the driver model has implications for driver overhead: Linux, WindowsXP, and Windows7 TCC drivers have small overhead. Windows7 WDDM drivers have significant overhead (which the CUDA driver tried to mitigate by batching etc, but this can cause other performance artifacts).

In terms of data transfer from and to the device ensure the PCIe interface is configured correctly. Your GPUs should run with a x16 PCIe gen2 interface, that when properly configured should give you a transfer rate of around 6 GB/sec in each direction for large blocks (say, 16MB). If any of your machines is a multi-socket system, make sure to tightly control NUMA features such as CPU affinity and memory affinity so the GPU “talks” to the “near” CPU and the “near” system memory.