"Unexpected flow graph", "exposed use",

I have a bunch of questions this time.

In attempting to compile an OpenACC code, I’m getting a message telling me that the compiler failed to translate the accelerator region due to an “Unexpected flow graph”. I think I understand in broad terms what this means, but I would appreciate a more specific explanation.

The same set of compiler outputs contains repeated mentions of

"Loop carried dependence due to exposed use of [array] prevents parallelization"

My first interpretation was that multiple threads were trying to update the same array, something that could be handled with atomics in CUDA. An alternative to atomics, which I implemented in the accelerator code, was to (1) create a special array just for the accelerator region, (2) zero it out before the OpenACC kernel, (3) perform a sum reduction over the special array after the kernel, and (4) add it back to the global array. However, that returns the same error message. So what is responsible for the message?

Lastly, there are several references to “Accelerator restriction: induction variable live-out from loop: i”. Some of these line numbers point to loops where the induction variable has been declared private; this suggests I don’t understand how the private declaration works, or what a live-out variable is. There are weirder instances of this message, though: sometimes it points to subroutine calls that don’t use that induction variable (edit to add: the subroutine is being inlined; I know OpenACC doesn’t handle subprogram calls right now). What’s going on there?

Thanks for any/all the advice you can give.

“Unexpected flow graph”

This is a compiler error. Can you please send PGI Customer Service (trs@pgroup.com) a reproducing example?

“Loop carried dependence due to exposed use of [array] prevents parallelization”. So what is responsible for the message?

This means that one or more of the arrays elements is being written to by more than one thread.

For your special array, have you manually privatized it? (i.e. added extra dimensions for each level of parallelization?) Or are every iteration of the loop using the same elements of the arrays (i.e. it’s a local scratch array)?

perform a sum reduction over the special array after the kernel,

Ideally, you want to use scalars for reductions so that you can utilize the “reduction” clause with a “loop” directive.

or what a live-out variable is.

A “live-out” means that the value of the scalar must be stored back to memory for later use. The problem being, which of the thread’s value gets stored back?
The obvious cases are when the variable is used on the right-hand side of an expression or in a subroutine call after the end of compute routine. Though, it can also occur when the variable has static storage (i.e. the SAVE attribute, using in contained routine, it’s a module variable, passed in as an argument). Also, sometimes branching can cause cases where the variable may or may not be assigned, causing it to remain “live” (i…e. it’s value is needed).

Hope this helps,
Mat

I’m treating it as a local scratch array (that is to say, I’m trying to give every thread its own copy of the array, then reduce all the copies of arrays), but it occurs to me that there’s much more than 64kB of memory being used if there is any significant number of threads per block (apologies for the CUDA terms, but that’s obviously the conceptual framework I’m coming from). This probably means that the compiler is shifting all those arrays back into device global memory, which will probably cause slowdowns as various threads work with non-contiguous chunks of global memory.

I think adding an extra dimension for threads – and then doing a reduction along that dimension – would work best, but how can I get OpenACC to do that? With CUDA, it’d be a cinch (pseudocode-wise, at least).

That matches what my working definition of “live-out” was, and it means the OpenACC compiler isn’t parsing my code like I’m expecting it to. Almost all of the messages I’m getting have to do with induction variables or variables that should be local to the accelerator region.

Edit: I suppose trying to diagnose the phantom “live-out” messages – those pointing to a line that didn’t involve the variable mentioned – would require seeing the code.

Hope this helps,
Mat

I’m appreciative of all the answers you’re giving. And hopefully future Google-searchers are as well.

t: I suppose trying to diagnose the phantom “live-out” messages – those pointing to a line that didn’t involve the variable mentioned – would require seeing the code

Feel free to send the code to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me. I’ll see what I can find out.

I think adding an extra dimension for threads – and then doing a reduction along that dimension – would work best, but how can I get OpenACC to do that?

The extra dimension would correspond to the loop’s iteration count, which in turn gets translated into the blocks and threads. Granted, there’s not necessarily a one-to-one correspondence, so you may be wasting some memory. You can also use the “private” clause at the gang (block) or vector (thread) loop in which case only the need number of copies of the array is created. You safe some memory but just loose some explicit control.

I’m appreciative of all the answers you’re giving. And hopefully future Google-searchers are as well.

You’re welcome!

  • Mat