In my 3d-FD code, previously I had some copyins like this:
This was taking quite some time, so I decided to use mirror allocations (of u0,u1 and alpha), and update clauses to synchronize host-device. Now, my code looks like:
I noticed a substantial improvement in performance. So, the question is - Would limiting the initial copyins-out statements in data region a better idea in favor of mirrored allocation and update statements to synch cpu-gpu?
The “copyin” clause causes the variables to be copied at this point in the program. With “mirror”, it’s entirely up to the user to copy data via the “update” clause. There is some overhead in allocating the data on the device, but overall there shouldn’t be much performance difference between the two if structured the same. In this case, the two don’t appear to be the same and it’s these differences that is causing the increased performance.
What you can do is profile the code to see how and where the differences are. Set the environment variable “CUDA_PROFILE=1” before you run your code. This will create a CUDA profile file with the timings for every device call.
Alternately, you can use the PGI “pgcollect” utility to get a mixed host and device profile. However, pgcollect will aggregate the timings while the CUDA profiler will list out every call.