I have a fairly large kernel with a bunch of independent equations which can be evaluated in any order:
E01 = E01 + ... E02 = E02 + ... : : : E26 = E26 + ... E27 = E27 + ...
Each equation is meaty, but not ridiculously so (a matter of perspective no doubt). All the equations use a combination of 15 sub-expressions combined in slightly different ways, e.g.
E01 = E01 + px*cy*dz + bx*qy*dz + bx*cy*qz... E02 = E02 + px*cz*dy + bx*qz*dy + bx*cz*qy...
If I swap say E06 and E07, i.e.
: E07 = E07 + ... E06 = E06 + ... :
as opposed to
: E06 = E06 + ... E07 = E07 + ... :
the kernel time goes down from 80ms to 70ms, which is pretty substantial. I’ve also found a permutation that makes the kernel time go as high as 130ms. I would love to try all possible permutations of the equations, but that would result in 27!=1.1E28 possibilities. Even if each compile/test took only 1s, it would take 3.45E20 years to do all of them…slightly more time than I have available.
I’ve narrowed down the cause of the “problem” to be variations in temporary variable formation, variable reuse and register spilling/loading. For example, the compiler might keep bx*qy in E01 as a temporary variable and reuse it wherever it can, but this type of optimization is greatly influenced by the order in which the equations are written. I’ve started rewriting the code to persuade the compiler (because, like a horse, it cannot be forced) to form and use certain temporary variables so as to make the permutations have no effect, but I’m not having any luck.
Does anyone have any advice about how to get around this? Is it possible to use more aggressive optimizations? I should probably add that the performance becomes progressively worse in moving from CUDA 4.2 to 5.0 to 5.5 as well as from PGI 13.5 to 13.10 to 14.3, which doesn’t bode well…
I’m more than happy to elaborate if more information is required.