I am porting a large CFD code to the GPU.
My strategy is to take a subroutine (sub1), rewrite it as a “stand alone” program, accelerate it using the pgi accelerator model, and then rewrite it as a subroutine again and call it in a “dummy main program” in a loop that simulates the iterations.
Now, sub1 had a speedup of 117x in the “stand alone version”. I declared the arrays static. used the !$acc data region clause to ensure that no data transfer occurs during calculation.
For the main program version, where sub1 is a subroutine again, I have the exact same code for sub1. Allthough the arrays are declared dynamic witht the !$acc mirror clause. Before entering the loop where sub1 is called, I allocate the arrays and transfer the data to the GPU using !$acc update device.
My problem is, that I only have now 45x for sub1…
I checked the compiler output informaton (-Minfo): there is no datatransfer during calculation either, so this can not be the reason for the slowdown…
My sub1 consits out of 6 loops. When I measured the time used for each loop, I saw that the first 2 loos where almost identical, whereas for loop 3 there was a huge difference! (And also a slowdown for the other loops)
The -Minfo of this loop3 shows (CC 2.0)…
… for the standalone version(117x):
… and for the version where sub1 is called from a main program (45x):
occupancy: 50 %
It seems wired to me that the faster loop has less occupancy…
Can you explain with the provided information why this could happen???
To measure the time of loop3 I captured the time before and after the loop, so I think that the difference in allocation of the arrays cant cause the slowdown, since when the loop is entered, in both cases the arrays are allocated on the device and also the data is on the device…
Or could it be due to this difference? since at compilation the compiler does not know the size of the arrays and therefore it can not opimally generate a strategy where to save which array and so on (register,shared,constant,…)
Thank you very much!!!