I just installed gtx780 and ran my application on it.
I was surprised to see that the performance was not better in the gtx780 compared to the gtx480 i have on an other computer.
I tried to add -arch=sm_35 but it didn’t help.
Anyone have ideas how can i take better advantage of my new card?
some random things come into my mind
-read the Kepler tuning guide
-reduce your shared memory use (the warp shuffle feature can replace this feature in some cases)
-make sure the large SMX units (192 CUDA cores each) have adequate occupancy
-run a profiling tool such as the Compute Profiler
-introduce some instruction level parallelism to make best use of the dual issue feature in the warp schedulers
cbunchner1 gave a lot of good generic advice. It is not clear how different the two computers are, it seems possible that app performance may be influenced or even dominated by the host system. You could try an apples-to-apples comparison using the same computer. What are the basic properties of the application, what is it bottlenecked by?
Echoing cbuchner1’s advice (and knowing no details about your application) I would suggest the very first thing you try is increasing the thread block size. The optimal block and grid configuration on Kepler cards is very different than on Fermi cards.
Thanks a lot ill try these and update here…