Compilation problems for loop parallelization

Hello,

I would like to improve performances of my code using parallelization of a loop. The loop’s data input are common for all loop cycles but the computations are clearly independent for each cycle. I think it is perfectly convenient for parallelization, though this loop calls a function which itself calls other subroutines and functions in chain but automatic inlining allows to make it parallelizable (at least to some extent, see below). The output of the loop is a 1D array of ‘nf’ complex numbers (‘nf’ being the number of cycles of the loop) which I want to finally export to a text file (separating real and imaginary parts of the complex number) for further processing with other softwares.

Here is the part of the code containing the loop I intend to parallelize:

273	!call acc_init(acc_device_nvidia)
274	!$acc region
275	!$acc do kernel, parallel, independent, private(k)
276	do k=1,nf
277		GxxV(k)=Gxxf(k,nl,d,c0,pi,fmin,df,eps,mu,sigma,alpha,dz,rho,theta,eta,zeta)
278	enddo
279	!$acc end region
280	open(unit=1,file='plmexjxFGPU64.out',status='replace')
281	do l=1,nf
282		write(1,*) real(GxxV(l)),imag(GxxV(l))
283	enddo
284	close(1)

The compiler command I used:

pgfortran plmexjx.f90 zbsubs.f machine.f90 -ta=nvidia -Minfo=accel -Minline,reshape -Mipa=inline,reshape

The compiler messages I got from it (Minfo=accel messages only):

plmexjx.f90:
zbsubs.f:
machine.f90:
plmexjx.f90:
PGF90-W-0155-Accelerator region ignored; see -Minfo messages  (plmexjx.f90: 274)
plmexjx:
    274, Accelerator region ignored
    276, Accelerator restriction: function/procedure calls are not supported
         277, Accelerator restriction: function/procedure calls are not supported
  0 inform,   1 warnings,   0 severes, 0 fatal for plmexjx
zbsubs.f:
machine.f90:
IPA: no IPA optimizations for 1 source files
IPA: Recompiling plmexjx.obj: new IPA information
plmexjx:
    274, Accelerator region ignored
    276, Accelerator restriction: size of the GPU copy of an array depends on values computed in this loop
                   277, Accelerator restriction: function/procedure calls are not supported
                   277, Accelerator restriction: : array accessed with too many dimensions, possibly due to inlining: ..inline
                        Accelerator restriction: : array accessed with too many dimensions, possibly due to inlining: real(..inline)
                        Accelerator restriction: : array accessed with too many dimensions, possibly due to inlining: imag(..inline)
                        Accelerator restriction: size of the GPU copy of '..inline' is unknown
                        Accelerator restriction: array accessed with too many dimensions, possibly due to inlining
                        Accelerator restriction: size of the GPU copy of an array depends on values computed in this loop
                        Accelerator restriction: one or more arrays have unknown size
              277, Accelerator restriction: size of the GPU copy of '..inline' is unknown
                   Accelerator restriction: loop has multiple exits
                   Accelerator restriction: one or more arrays have unknown size
                   Accelerator restriction: array accessed with too many dimensions, possibly due to inlining
  0 inform,   1 warnings,   0 severes, 0 fatal for plmexjx
IPA: Recompiling zbsubs.obj: new IPA information

I think several problems occured at compilation but I have difficulties in determining their origin:

  1. Apparently some inlining problems are still remaining but I cannot locate them because the list of ‘-Minfo=inline’ messages is so long that I cannot access the first messages in the compiler command window. Is there a way to export compilation messages to a text file…?
    For this, I just saw your reply to my previous post. I am using command line shell PGI Workstation 12.4(64) on Windows. I tried what you suggested but did not succeed, but may be I did not catch the idea… I typed:
> pgfortran plmexjx.f90 zbsubs.f machine.f90 -ta=nvidia -Minfo=accel -Minline,reshape -Mipa=inline,reshape

and then:

> & logfile.txt

Is it correct?

I suspect these inlining problems to occur from the presence of GO TO statements in the last procedure called in the chain, which would cause ‘loop multiple exits’ (?). But this is strange to me because these GO TO statements only point inside the procedure (do not produce exit from it), and sequentially in the loop cycles

  1. Due to inlining, I also have apparently problems with the output variable of the loop (GxxV), notably when I want to write the real and imaginary parts in the output text file. I tried to solve it including the ‘independent’ clause, but unsuccessfully. What is it due to?

  2. One or more arrays have unknown size??? Though all array sizes have been explicitely allocated at variable declaration, before the loop

  3. Array accessed with too many dimensions, possibly due to inlining… What does it mean, and how to solve it?

Thanks a lot in advance for your help.
Fred

Hi Fred,

Is it correct?

I you are using the CYGWIN bash shell that ships with the PGI Workstation on Windows, then yes. For DOS, I believe the command to redirect stderr is “command >2 logfilename.txt”. Though, I’m not a DOS user.

I suspect these inlining problems to occur from the presence of GO TO statements in the last procedure called in the chain, which would cause ‘loop multiple exits’ (?). But this is strange to me because these GO TO statements only point inside the procedure (do not produce exit from it), and sequentially in the loop cycles

Without seeing the code I can’t be sure, but typically yes the GOTO would prevent the acceleration of a loop. However, my guess is that these messages pertain to the inner loop and would not prevent the outer loop from being parallelized.

  1. Due to inlining, I also have apparently problems with the output variable of the loop (GxxV), notably when I want to write the real and imaginary parts in the output text file. I tried to solve it including the ‘independent’ clause, but unsuccessfully. What is it due to?

Using independent is correct, however, the “…inline” unknown size is most likely causing the outer loop from being accelerated.

  1. One or more arrays have unknown size??? Though all array sizes have been explicitely allocated at variable declaration, before the loop.
  2. Array accessed with too many dimensions, possibly due to inlining… What does it mean, and how to solve it?

“…inline” are compiler generated temp arrays which are by product of how inlining currently works. Unfortunately, this is an known limitation of using inlined routines with local arrays within accelerator regions. We are looking at fixes but the fix will require some fundamental changes in how inlining is performed. This work is currently planned for later this year once OpenACC development is completed. The current work-around is to manually inline these routines and then use the “private” clause to privatize these arrays.

Note, you are welcome to send your code to PGI Customer Service (trs@pgroup.com) and we will then us it as a test case.

Best Regards,
Mat

Dear Mat,

As you kindly proposed, I have sent the codes by email to PGI Customer Service with explanations and a description of the problems. Could you please confirm me good reception of my email?
Please, note that these codes should remain strictly confidential.

Regarding exportation of the compilation messages from the command window to a text file, I still have problems… I used CYGWIN and the file is now well created, but is empty. May be I missed something in the command lines?? I used:

> pgfortran plmexjx.f90 zbsubs.f machine.f90 -c -ta=nvidia -Minfo=inline -Minline,reshape -Mipa=inline,reshape

> & logfile.txt

Thank you very much again for your valuable help!!

Best regards,
Fred

Hi Fred,

Please, note that these codes should remain strictly confidential.

Understood.

Note that I’m in Boston attending a conference so most likely wont be able to work on this til next week. Sorry for the delay.

  • Mat

Dear Mat,

OK no problem, I understand. We can wait for some extra days :-)

Note that I have sent a new email to your Customer Service with the original version of the code. In the one I have sent yesterday, I removed the call of zbsubs.f in func intexjx in order to test if the compilation problems might be due to this procedure (GO TO statements, notably) and I forgot to cancel this modification before sending it to you.
So now you have the correct code. Please excuse me for this clumsiness.

Best regards,
Fred

Dear Mat,

After providing the codes and description of the issues we encountered, the PGI support informed me that it has been logged as TPR 18676.
Would you have any idea if it will be possible to work around these problems and within which delay?

On my side, I cannot make any further tests as my PGI trial licence expired and, also, I do not really have any idea for other tests in trying to circumvent these issues. But any suggestion is welcome! :-)

Many thanks for your consideration.

Kind regards,
Fred

Would you have any idea if it will be possible to work around these problems and within which delay?

Basically, you need to manually inline your routines. So you should be able to, but it would take some effort.

Note NVIDIA has announce some new features in their next generation processor (Kepler Tesla K20) and CUDA 5.0 (both are expected to be available in Q4 2012) which will enable us to support true calling on the GPU and ellimate the need for inlining. Granted we have work to do here as well and need to extend the OpenACC and PGI Accelerator Model to express this so I don’t expect our support to be available for a few months after the K20 is available. Then again, calling is our number one most requested feature and one of the few major inhibitors of adoption by users, so it is high priority item for us.

I cannot make any further tests as my PGI trial licence expired

That’s fix able. Send a note to PGI Sales (sales@pgroup.com) or PGI Customer Support (trs@pgroup.com) asking that your trial period be reset.

I do not really have any idea for other tests in trying to circumvent these issues

If all your code is like this, and your are unwilling or unable to restructure the code to be better suited to today’s GPUs, then you may want to wait for the Tesla K20.

Though if you do want to get started now in order to gain experience, you can try the example codes that accompany the compilers or some larger examples found in PGinsider Newsletter (Technical Articles and Publications | PGI).

Note that there are many other reasons to purchase the PGI compilers besides GPU directives. Our x86 compilers are able to produce very high performance binaries on both new and older generation Intel and AMD architectures. Our Profiling and Debugging tools have excellent interfaces and are MPI and OpenMP enabled. We developed CUDA Fortran for those users wanting a more “hands-on” approach to GPU programming. We’ve even starting targeting ARM processors with OpenCL (granted I assume you are not running your code on cell phones). So don’t let this one issue stop you.

  • Mat

Dear Mat,

Thanks for these information and advices.

This is very good news that next generation NVIDIA processors and CUDA version will tolerate true calling, we are looking forward for that.
In the meanwhile, I will extend the trial period and try to inline the code manually as you suggested. Nevertheless, given the complexity of our code and given the fact that inlining using your compiler seems to work to some extent, I would like to know if it would be possible to access the inlined code generated by the compiler (even if it is not completed and still contains some errors that I would try to fix myself)? If possible, this would save me from having to inline all the code manually.

Best regards,
Fred

I would like to know if it would be possible to access the inlined code generated by the compiler

Unfortunately, no. I asked our compiler team about doing this a few years ago. The problem is that by the time inlining is performed, the code has already been translated into ILM, an intermediate representation, and we don’t have a method to translate the ILM back into a particular language.

  • Mat