I am baffled by the run time difference of the same program between two machines. The program involves numerous CUDA kernel launches and a lot of data transfer between CPU and GPU.
Both computers run 4 cores i7-6700HQ CPU @ 2.60GHz with and 16 GB RAM.
On my MSI laptop, with a 2.GHz CPU and 1280 cores GTX 970M GPU, execution time is about 11 minutes.
With its 3.4GHz CPU and 1920 cores GTX 1070 GPU, my rack-mount computer is supposed to be faster, yet the same program takes over an hour to execute! Can anyone suggest where to look for the bottleneck there?
Are you running an identical executable on both machines, or are you recompiling the source code on each machine separately? If the latter, check your build settings to make sure they are identical (except for the target architecture, of course!). In particular, check whether -G or other debug related flags are being used on the slower machine (the large difference in run time suggests that might be the case).
If the build settings are consistent, start timing relevant portions of the application. Zero in on the portions that seem to be responsible for most of the overall difference in application run time with the CUDA profiler. Sooner or later the results should give rise to working hypotheses as to what the root cause of your observations may be. Eliminate working hypotheses one by one through additional experiments.
Note that the issue may have nothing to do with CUDA. Maybe the app is reading a lot of data from mass storage, and instead of accessing a local hard drive on your laptop it is now pulling data from an NSF mounted remote volume on the server, while the network is getting hammered by other jobs. One could come up with more wild speculation like this, given that the amount of information about the application is near zero.