compiler ask acc routine information for internal function

Hello,

We are currently testing PGI Accelerator to see if it can help us improve our computation time, but we are getting this error message:

[...omitted output]

    909, Loop is parallelizable
         909, Conditional loop will be executed in scalar mode
              Loop carried dependence due to exposed use of ..inline(:,:),..inline(:) prevents parallelization
              Complex loop carried dependence of ..inline prevents parallelization
PGF90-S-0155-Procedures called in a compute region must have acc routine information: pgf90_copy_f77_argsl (abis_dll.f03: 1469)
PGF90-S-0155-Procedures called in a compute region must have acc routine information: pgf90_copy_f77_argsl (abis_dll.f03: 1458)
PGF90-S-0155-Procedures called in a compute region must have acc routine information: pgf90_copy_f77_argsl (abis_dll.f03: 1434)
PGF90-S-0155-Kernel region ignored; see -Minfo messages  (abis_dll.f03)
sea:
   1434, Accelerator restriction: call to 'pgf90_copy_f77_argsl' with no acc routine information
   1458, Accelerator restriction: call to 'pgf90_copy_f77_argsl' with no acc routine information
   1469, Accelerator restriction: call to 'pgf90_copy_f77_argsl' with no acc routine information
  0 inform,   0 warnings,   4 severes, 0 fatal for sea

Here a extract of the code where the error occur

      r(1:3,n) = vectoriel(t(1:3,n),at(1:3,n)) ! <- line 1425
      u(1:3,n) = vectoriel(t(1:3,n),f(1:3,n))
      do i=n-1,1,-1
        j=i+1
        h=(s(j)-s(i))*pl(j)
        hfl=0.5d0*(s(j)-s(i))*ffl
        do k=1,3
          at(k,i)=at(k,j)+(f(k,i)+h*gr(k)+hfl*(fl(k,i)+fl(k,j)))
        end do
        r(1:3,i) = vectoriel(t(1:3,i),at(1:3,i))  ! <- line 1434
        u(1:3,i) = vectoriel(t(1:3,i),f(1:3,i))
      end do

As you can see the instructrion line 1434 is almost the same as line 1425, which do not rise a error.

Here the vectoriel function:

      pure function vectoriel(x,y) result(v)
!$acc routine
      implicit none
      real*8, dimension(3), intent(in) :: x,y
      real*8, dimension(3) :: v
      v(1)=x(2)*y(3)-x(3)*y(2)
      v(2)=x(3)*y(1)-x(1)*y(3)
      v(3)=x(1)*y(2)-x(2)*y(1)
      end function

The file is compile with .f03 extension, using this command line:

FCFLAGS  = -fast -Minline,reshape -Minfo=accel -m64 -tp=x64
ACCFLAGS = -acc -ta=tesla,host
$(FC) $(FCFLAGS) $(ACCFLAGS) -c app.f03 app.$(OBJ)
$(FC) $(FCFLAGS) $(ACCFLAGS) -o app.$(EXE) app.$(OBJ)

If anyone have a idea of the problem, I would greatly appreciate it.

The trial is almost over and we’d like to have a working prototype to decide if we want to buy a license or not.

Thank you.

Alex.

Hi Alex,

Can you try calling the routine using “1,i” instead of “1:3,i”?

[code]r(1:3,i) = vectoriel(t(1,i),at(1,i))  ! <- line 1434[/code]

What’s happening is that since you’re passing in a sub-array, we need to call a runtime routine to determine if the array is contiguous or if we need to create a contiguous temp array to pass to the subroutine. This runtime routine has not been ported to the device, mostly due to the fact that creation of the temp array is very expensive.

Though, would mind sending a reproducing example to PGI Customer Service (trs@pgroup.com)? I’d like to see if our engineers can do something better here.

The trial is almost over and we’d like to have a working prototype to decide if we want to buy a license or not.

Customer Service can extend your license if needed.

  • Mat

Hello,

Your suggestion is working.
I’ll try to make and send a code example.

I now get an error from the nvvm compiler:

nvvmCompileProgram error: 9.
Error: app.n001.gpu (3893, 24): parse error: invalid redefinition of function 'contact_908_gpu'
pgnvd-Fatal-Could not spawn C:\Program Files\PGI\win64\15.3\bin\pgnvvm.exe
PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (app.f03: 1)
  0 inform,   0 warnings,   1 severes, 0 fatal for X.sqrt

I had this one before and never found a solution except by modifying the content of the kernels block. (mainly removing stuff from it)

Thanks.

Hello again,

The previous error occur only with compile flag “-tp=x64”.

Removing this flag seams to works… until the next error ;)

[.. ommited line regarding creation of .obj...]

pgfortran -fast -Minline,reshape -Minfo=accel -m64 -acc -ta=tesla:keep -o app.exe app.obj
nvlink : error : Undefined reference to 'cudaMalloc' in 'app.obj'
nvlink : error : Undefined reference to 'cudaFree' in 'app.obj'
pgnvd-Fatal-Could not spawn C:\Program Files\PGI\win64/2015/cuda/6.0/bin\nvlink.exe
child process exit with signal 127: C:\Program Files\PGI\win64\15.3\bin\pgnvd.exe

Is there a way or a guide on how to avoid these nvlink and pgnvvm/pgnvd error ?

I’m playing around and I’m getting a lot of errors from them, like these

nvvmCompileProgram error: 9.
Error: app.n001.gpu (753, 43): parse error: integer constant must have integer type
pgnvd-Fatal-Could not spawn C:\Program Files\PGI\win64\15.3\bin\pgnvvm.exe
PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (app.f03: 1)
  0 inform,   0 warnings,   1 severes, 0 fatal for X.sqrt



nvvmCompileProgram error: 9.
Error: app.n001.gpu (3033, 22): parse error: invalid cast opcode for cast from 'i32' to 'double'
pgnvd-Fatal-Could not spawn C:\Program Files\PGI\win64\15.3\bin\pgnvvm.exe
PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (app.f03: 1)
  0 inform,   0 warnings,   1 severes, 0 fatal for X.sqrt

Hi,

I was not able to produce a standalone piece of code to reproduce this issue. Unfortunately, I can not disclose the original code either.


Quick question, is the c compiler better at “accelerator” optimization or does it have the same level of quality ?

I have to admit, pgfortran is a bit capricious about my code, making me change a lot of the code and I wasn’t able to produce a single binary with better performance in the 15 days of the trial (ending in few hours)

If the c compiler have a better support, we would be willing to port our code to c since we are already required to rewrite a large part of it.

Thank you.

Hi Alex,

OpenACC support in both Fortran and C is fairly mature. Though, I think you’re hitting a few limitations and some unexpected errors.

For the “-tp=x64” issue, this is because OpenACC isn’t supported with Unified Binary. We should be catching this and issuing an error. I’ve added TPR#21565.

For the undefined references to “cudaMalloc” and “cudaFree”, are you calling these functions? If so, then you’ll need to add “-Mcuda” to your link line in order to bring in the CUDA libraries.

For “parse error: integer constant must have integer type”, we’re generating bad LLVM code. Though without a reproducer I can’t tell why. Try compiling with “-ta=tesla:nollvm”.

For “parse error: invalid cast opcode for cast from ‘i32’ to ‘double’”, this is most likely the same as a known error where the LLVM code generator doesn’t yet support printing from compute regions. Try adding “-ta=tesla:nollvm”.

I wasn’t able to produce a single binary with better performance in the 15 days of the trial (ending in few hours)

It could be that you need to optimize data movement, don’t have a large enough data set, haven’t exposed enough parallelism, aren’t accessing data across the stride-1 dimension (vector), etc. Did you profile your code to determine where the time was being spent?

  • Mat

So we would hit the same limitations in C you think ?


I had an “allocate” and “deallocate” call in one of the function handle by a !$acc routine directive. I was able to replace the allocatable array with a fixed size array in that case. The error disappear after removing both call.

Yes, I’m sure there is a lot I haven’t done properly.
What I meant to say was, even without performance gain, I wasn’t able to make a binary properly calling the gpu for a block of code with a bit a complexity, like a do loop including calls to subroutine/functions handle with !$acc routine directive.

I have made gpu accelerated binary when adding simple do loop without complex code in it, but this does not provide performance gain since the kernels is called thousands of times.

Yes, both with profiler and with in code counter/timer. One subroutine in particular is called about 50 000 times in my test case. This subroutine is about 60 lines. I have tried to generated gpu code around the do loop calling it, without success. Adding some gpu in this subroutine works, but reduce performance as the kernels is called 50 000 times.

As I’ve said before, our code is old and need a lot of rewrite, so I’m not surprise I had difficulty to generate a working example.

For now I have other thing to attempt and can not spent more time of this, but we might come back to it in later time.

Two questions:

  • Do you provide training on this subject?
  • Or, do you do porting/development services? (we provide you the source + test case + nda, you port it)

Thank you.

Hi Alex,

So we would hit the same limitations in C you think?

The Unified Binary restriction would also apply to C.
You can allocate data from within compute regions for either language, but I wouldn’t advise it given the performance impact of having every thread allocate data.
The other two, I’m not sure since I don’t know what’s causing the problem.

I have made gpu accelerated binary when adding simple do loop without complex code in it, but this does not provide performance gain since the kernels is called thousands of times.

If there’s enough work and parallelization in the routine then the number of times it’s called doesn’t matter as much. However if it’s small, then the overhead of the kernel launch will impact the performance. Ideally in these cases, you can move the compute region higher up to expose more parallelization and/or push more work into the compute region.

Two questions:

  • Do you provide training on this subject?
  • Or, do you do porting/development services? (we provide you the source + test case + nda, you port it)

We do not, but do partner with other companies who specialize in training and consulting.

See: External Training and Consulting Resources | PGI

For you, I’d recommend contacting Acceleware.

Hope this helps,
Mat

Thank you!

Hi,

We are currently using/testing PGI Accelerator to see if it can help us improve our computation time, but we are getting this error message:

Command exit code: 2

Command output: [nvvmCompileProgram error: 9. Error: $\pgacc2a4KNcOrtQ405y.gpu (16, 72): parse integer constant must have integer type pgnvd-Fatal-Could not spawn $\bin\pgnvvm.exe $\mod_global.F90(1) : error F0155 : Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code PGF90/x86-64 Windows 16.3-0: compilation aborted $\mod_global.F90: ]

In mod_global.F90 we store all the parameters that are accessed by the different subrountines.
I’m not specifying/using any -tp flag, and I already tried with your “-ta=tesla:nollvm” Suggestion, it didn’t solve the Problem…
I would greatly appreciate any help/guidance with this, I’m rather new with parallel programming Tools…
Thanks in advance,



Mila

Hi Mila,

Can you please send a reproducing example to PGI Customer Service (trs@pgroup.com)?

I see one similar report that’s being caused by the use of a Fortran linked list data structure, i.e. a user defined type which contain pointers to another user define type. However, this error was fixed in release 15.9 so may or may not be related.

Thanks,
Mat

Hi Mat,
thanks for your quick reply!
I will send you that email in a couple of minutes.
Cheers,


Mila

21565 - OpenACC: using “-tp=x64” causes device redefinition errors

has been correct in the current 17.9 release.

dave