GTX 780ti integer operations(Giops)

Been testing this GPU in linux, and WOW it is fast. Even when I use 64 bit integers it often outperforms the K20.

For example the GTX 780ti only takes 8.17 seconds to generate all 13! permutations of an array in local memory(no evaluations of permutations, just generates all with no repetition), when the K20 takes about 12 seconds for the same task.

For 14! it takes the GTX 780ti 126.37 seconds, while the K20 takes 188 seconds.

The funny thing is that other pieces of code which have mostly 32 bit float operations, the GTX 780ti is not always faster, and for some data shapes the GTX 780ti is slower by as much as 40% when compared to the K20.

Is there a fundamental difference in the way the GTX 780ti handles integer operations when compared to the K20 ?

I am not aware of any fundamental differences in the integer processing units of various sm_35 GPUs, but I am not a hardware expert.

Are you sure the performance differences observed can’t be explained by a combination of different clocks, different memory bandwidth, and scheduling differences caused by the different number of SMs (e.g. when the number of thread blocks is evenly divisible by SM count)? You could use the profiler to drill down and identify key differences for a given workload. K20 is PCIe Gen2 while 780 Ti is PCIe Gen3, so that could make a difference if your system supports PCIe Gen3 and application performance has a dependency on host/device transfers.

http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-780-ti/specifications
http://www.nvidia.com/content/PDF/kepler/Tesla-K20-Active-BD-06499-001-v04.pdf

I assume you are performing the K20 vs 780Ti comparison as a controlled experiment where nothing in the system changes except that the GPUs are swapped. Otherwise there could be numerous other effects that impact application-level performance.

Why are 32-bit floating point operations slower on the GTX 780 Ti than on the K20?

How was it determined that the GTX 780 Ti is “slower with 32-bit floating-point operations than the K20”? A quick but reasonable assessment might be the comparison of calls to cublasSgemm() with large matrices, say dimensions m=n=k=8192. What throughput do you measure on K20, what on GTX 780 Ti? It appears you are running Linux, so influence from different driver models (WDDM vs TCC) should not be a factor.

As my work is focused on the professional line of Tesla GPUs I don’t have hand-on experience with the GTX 780 Ti, but the specifications suggest that it should not be slower: GTX 780 Ti has 2880 cores at 875 MHz base clock, the K20 runs 2496 cores at 706 MHz.

I am aware that the GTX 780 Ti uses dynamic clock boosting (up to a maximum boost clock of 928 MHz per the specification I linked in my previous post), so you may want to check what the effective clocks are when you are running benchmarks. The dynamic clocks may also play a part in the different relative performance of GTX 780 Ti vs K20 on the integer-intensive codes run by CudaaduC.

There are many differences between the machines (operating systems, CPU setups, RAM ),and I am aware that there are a number of factors at work.

Basically I just ran my typical permutation code and compared the results on these two totally different machines (though both are higher-end desktops). Even went back and fine tuned the code on the K20 to be about 10% faster, which was nice.

I just wondered if there was some design change made regarding computation using 32-bit integers.The degree of difference is greater than that which could be explained by the higher clock speed and increased number of cores.

As far as the 32 bit floating point comparison, that is for a rather complicated solver, so there could a number of other factors at work. I never said the GTX 780ti was always faster, rather I said for some cases(shapes of data sets) in my working set of code the K20 had better times.

For cuBLAS 32 bit float Sgemm() the GTX 780ti seemed about 19% faster than the K20.
Did not test 64 bit yet.

I think digging at the differences with the help of the profiler (after eliminating as many of the system differences as possible) is the best bet for figuring out the root causes of the performance differences.

When benchmarking a GTX 780 Ti or similar high-end consumer card, it is essential to always be aware of the dynamic clock boosting feature. The boost range tends to fairly wide, and the resulting range of performance can thus be wide as well.

There is no guarantee that consecutive runs of the same application on the same GPU in the same system will run at identical clocks. Nor is it guaranteed that two different physical GPUs of the same type will run at the same clocks when plugged into the same system and running the same app.

I did not know that actually, thanks.

My results on the Tesla have always been very consistent and I have been pleased with the performance across the full spectrum of problems.

Would not mind having the 780ti for the WDMM video out (slot 0), and the K20 in slot 1. Then get a dual boot and possibly use both from the linux partition. Heck, I can just the 780ti for games…

There is no dynamic clock boosting with Tesla GPUs as this would interfere with cluster operation. However, recent Tesla cards support a manual clock boost as users can dial in “application clocks”. These can be set with the -ac switch of nvidia-smi. Use “nvidia-smi –q –d SUPPORTED_CLOCKS” to have nvidia-smi display the suppported clock settings.

This manual clock boosting allows customers to run particular applications at higher clocks if the power budget allows it. I would encourage CUDA users to give this a try with their applications. The AMBER team for example recommends it in their GPU performance tuning docs (http://ambermd.org/gpus/).