My older computer has a GTX 480 with CUDA 4 installed, which I use to do some computations. I just got a newer computer with a GTX 780. Because it was easier to keep everything the same, I installed CUDA 4 on the newer computer as well. So as far as I know the software environments on the two machines are the same.
I naively assumed that the GTX 780 would be faster than the GTX 480, it being a newer more powerful rig. But when I test it using the same software/same computations, it actually runs slower, by a factor of about 3 or so. I don’t have enough knowledge of CUDA or the nVidia hardware to even begin to guess why. Can anyone help me?
I’m also wondering whether moving to CUDA 6 on the new machine would help. I haven’t done so because of software compatibility issues between the compiler needed for CUDA 6 and other packages I need to use (i.e., MATLAB). But I’d try to figure that out if I knew that would solve the slowness issue.
One new device is a Kepler type device (compute 3.5), the older one is a Fermi device with compute 2.x.
In order to make older code run fast on Kepler, a lot of changes are required
a) change grid and block layouts to better suit the hardware
b) cut down on the use of shared memory
c) make use of Kepler specific features such as warp shuffle, __ldg() intrinsics, etc…
and read the Kepler tuning guide ;)
The 4.0 toolkit predates the sm_3x Kepler GPUs, so your compiler is producing sm_20 Fermi PTX which is then JIT compiled by the driver into Kepler SASS. While this works, the PTX was still designed for sm_20, which is bound to be inefficient for the very changed architecture. Switching to 6.5 and generating native sm_35 will help, but as cbuchner says, you still need to retune for Kepler’s significantly different bottlenecks and advanced features like shuffle and _ldg.
One other thing that may be involved is your choice of OS. I am assuming your ‘new’ machine uses a newer OS, such as Windows 7 or 8. If your old one used windows XP then you have a difference there that can add a massive performance change. With Vista (and all subsequent versions of windows) Microsoft added a new driver interface for graphics cards called WDDM. This basically allows programs needing the graphics card the flexibility to over-subscribe the graphics memory (similar to paging for system RAM) as well as extra security features. The downside to this is it does not make sense in a CUDA environment where you have pointers etc, so windows sticking its nose in every time you are trying to do something such as a kernel launch has a performance hit. The amount of this hit depends on the program, but as an example my highly iterative program is almost exactly twice as fast on Ubuntu without the WDDM overhead as on windows 7. Same code, same hardware just different OS meant a doubling of the speed.
Thanks for the advice everyone! I’ll try out those suggestions.