CUDA FORTRAN optimization sensitivity

Hi All,

I have a fairly large kernel with a bunch of independent equations which can be evaluated in any order:

E01 = E01 + ...
E02 = E02 + ...
:
:
:
E26 = E26 + ...
E27 = E27 + ...

Each equation is meaty, but not ridiculously so (a matter of perspective no doubt). All the equations use a combination of 15 sub-expressions combined in slightly different ways, e.g.

E01 = E01 + px*cy*dz + bx*qy*dz + bx*cy*qz...
E02 = E02 + px*cz*dy + bx*qz*dy + bx*cz*qy...

If I swap say E06 and E07, i.e.

:
E07 = E07 + ...
E06 = E06 + ...
:

as opposed to

:
E06 = E06 + ...
E07 = E07 + ...
:

the kernel time goes down from 80ms to 70ms, which is pretty substantial. I’ve also found a permutation that makes the kernel time go as high as 130ms. I would love to try all possible permutations of the equations, but that would result in 27!=1.1E28 possibilities. Even if each compile/test took only 1s, it would take 3.45E20 years to do all of them…slightly more time than I have available.

I’ve narrowed down the cause of the “problem” to be variations in temporary variable formation, variable reuse and register spilling/loading. For example, the compiler might keep bx*qy in E01 as a temporary variable and reuse it wherever it can, but this type of optimization is greatly influenced by the order in which the equations are written. I’ve started rewriting the code to persuade the compiler (because, like a horse, it cannot be forced) to form and use certain temporary variables so as to make the permutations have no effect, but I’m not having any luck.

Does anyone have any advice about how to get around this? Is it possible to use more aggressive optimizations? I should probably add that the performance becomes progressively worse in moving from CUDA 4.2 to 5.0 to 5.5 as well as from PGI 13.5 to 13.10 to 14.3, which doesn’t bode well…

I’m more than happy to elaborate if more information is required.

Cheers,
Kyle

Hi Kyle,

It does appear to be register allocation and I can concur that later versions of CUDA do seem to try and put more variables in registers. What does the ptxinfo say (-Mcuda=ptxinfo) are the number of registers being used? Can you try limiting the number of registers used to see if it has an effect (-Mcuda=maxregcount:)?

If you can send us the code, I can pass it along to someone for more detailed analysis.

  • Mat

Hey Mat,

Thanks for the reply. Hope all is well that side.

In this particular kernel, the number of registers is 127 (with 512 threads on a K20c which maxes out the total number of registers) in both versions (E06 before or after E07), but the number of spilled and loaded registers increases significantly causing the performance to decreases. From other tests I’ve done and from looking at the sass code, only the number of loads and spills increase, not the number of operations and registers (though registers wouldn’t because they’re already maxed out). It’s like the intermediate variables start getting used further apart from each other causing the extra spills and loads and the decreased performance. Or different intermediate variables are formed by the compiler resulting in more spills and loads. I didn’t want to modify the number of registers or threads in this situation. I wanted to focus on code optimization at constant number of registers and threads. Modifying number of threads and registers does make a difference, but again, I want don’t really want to focus on that. I’ll speak to my supervisor about sending the code through to PGI. Any other ideas in the meantime?

Cheers,
Kyle

Hi again,

Just a small addition: Am I correct in saying that a greater number of spills and loads at a constant number of registers (127 in my case) indirectly implies that more registers are required, assigned and used?

~Kyle

Hi Mat,

I’m busy working on a standalone version of one of my kernels. I’ll send it to you ASAP.

On the issue(s) of temporary variable and/or extra register assignments:
I don’t really understand the reasoning behind assigning more temporary variables/registers when doing so will increase the number of spilled registers. Or perhaps I should say I don’t see why this is a decision that the programmer has no say in. I understand why it might be good to assign them if you haven’t reached the maxregcount limit (hardware or user defined), but not if you’ve already surpassed the limit.

As a thought experiment, let’s say I have a kernel that can fit into 110 registers if no extra temporary variables are created by the compiler (particularly during the FORTRAN to C conversion) and let’s make the maxregcount 127. If the compiler creates more temporary variables (which it seems is impossible to prevent?), the required number of FLOPS would decrease by some number (good I suppose) and the number of registers would increase (not bad, until registers start getting spilled?!). I might be wrong, but I would think it’s better to perform extra redundant FLOPS (on the order of 2 cycles?) and not spill registers rather than performing less FLOPS (by using temporary/intermediate values) but more MOPS (by having to pull in spilled registers from local memory; on the order of 100 cycles?)?

Please let me know if I’m missing something obvious (always possible with these evil machines).

Cheers,
Kyle

Just a note from the peanut gallery here, but you might actually want to try taking maxregcount smaller than you think.

My code is rather register-iffic and I’ve found that with, say, K20x cards, if I tell PGI to use less registers than it would be default, I can get a big performance increase. (I think it wants 127 or something, but 72 is best for me.)

At this point, I usually do a scan on the maxregcount with a test kernel every new release/hardware just to see if something has changed and update my build accordingly.

Matt

Hi Matt,

Thanks for the response. Glad to get other ideas and opinions.

I agree completely with you on reducing the number of registers to get better performance. I’ve seen that many times. I just feel that it’s getting a desired effect with the wrong cause, if that makes sense. Maybe I’m wanting too much control?

Doing a maxregcount scan is a bit unfeasible for me (or rather I’m a bit lazy). I have 36 kernels at the moment (increasing soon) and each one has a different register requirement.

Cheers,
Kyle

Kyle,

Well, I have quite a few myself, but I just issued a blanket maxregcount in my make script. In the end, many didn’t seem to have any effect with it, but one (an expensive one) did, so I kept the best number and applied it to all. That makes the scripting easy with a single for loop!

Matt