Accelerator region ignored; no parallel kernels found

Hi,
I have just started using PVF Accelerator Compiler. I wrote a simple code and tried to implement it, but I get an error as…

warning W0155 : Accelerator region ignored; no parallel kernels found

The program is…

PROGRAM MAIN

USE ACCEL_LIB

DIMENSION A(100000000),B(100000000),C(100000000)

!PROGRAM TO ADD TWO MATRICES USING THE ACCELERATOR

OPEN(UNIT=100,FILE=‘INPUT.DAT’,FORM=‘FORMATTED’,ACCESS=‘SEQUENTIAL’)

OPEN(UNIT=200,FILE=‘OUTPUT.DAT’,FORM=‘FORMATTED’,ACCESS=‘SEQUENTIAL’)

READ(100,’(I12)’) N

DO I = 1,N

A(I) = I

B(I) = 2*I

ENDDO

!$ACC REGION COPYIN(A,B) COPYOUT©

DO I = 1,N

C(I) = A(I) + B(I)

ENDDO

!$ACC END REGION

END

Plz help.

Hi Saumik,

An optimization called “dead code elimination” is removing C since it’s never used. Hence, there is nothing to accelerate. To fix, print a value of C after the loop.

Hope this helps,
Mat

Hi Mat,

I added a write statement after the loop, but got the following errors…

Error 1 unresolved external symbol _MAIN referenced in function MAIN tp.obj
Error 2 unresolved external symbol _cudaThreadSynchronize referenced in function ___pgi_cuda_launchk libacc1cu.lib
Error 3 unresolved external symbol _cudaEventCreate referenced in function ___pgi_cuda_launchk libacc1cu.lib
Error 4 unresolved external symbol _cudaEventRecord referenced in function ___pgi_cuda_launchk libacc1cu.lib
Error 5 unresolved external symbol _cudaEventSynchronize referenced in function ___pgi_cuda_launchk libacc1cu.lib
Error 6 unresolved external symbol _cudaEventElapsedTime referenced in function ___pgi_cuda_launchk libacc1cu.lib

Is there something I am missing?

Saumik.

Is there something I am missing?

When you installed the compilers, did you also install the CUDA Tookkit that is included in the PGI installation package? If you are not sure, look in the directory where your PGI compilers are installed (.i.e. $PGI). The CUDA libraries should be in “$PGI/linux86-64/2011/cuda/” If you do not have this directory, please reinstall the compilers, and select ‘yes’ when asked if you wish to install the CUDA Toolkit.

If you do have the CUDA Toolkit installed, please post your complete compilation, link, and output with the addition of the “-v” (verbose) flags.

Thanks,
Mat

Hi Mat,
I added a statement to write out the array C in output.dat and compiled the code (tp.f90) with the following command line option:

pgfortran tp.f90 -ta=nvidia -Minfo

and got the following output:

main:
30, Generating copyin(b(:))
Generating copyin(a(:))
Generating copyout(c(:))
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
32, Loop is parallelizable
Accelerator kernel generated
32, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
CC 1.0 : 5 registers; 36 shared, 4 constant, 0 local memory bytes;
100% occupancy
CC 1.3 : 5 registers; 36 shared, 4 constant, 0 local memory bytes;
100% occupancy
CC 2.0 : 5 registers; 4 shared, 48 constant, 0 local memory bytes;
100% occupancy

The problem I have is the generated output is all zeros in the output.dat. What could be the reason?

Saumik

Hi Saumik,

What could be the reason?

I’m not sure. It works for me.

Do you have an example?

  • Mat
% cat test.f90 
PROGRAM MAIN
USE ACCEL_LIB
DIMENSION A(100000000),B(100000000),C(100000000)

!PROGRAM TO ADD TWO MATRICES USING THE ACCELERATOR
OPEN(UNIT=100,FILE='INPUT.DAT',FORM='FORMATTED',ACCESS='SEQUENTIAL')
OPEN(UNIT=200,FILE='OUTPUT.DAT',FORM='FORMATTED',ACCESS='SEQUENTIAL')

READ(100,'(I12)') N

DO I = 1,N
  A(I) = I
  B(I) = 2*I
ENDDO

!$ACC REGION COPYIN(A,B) COPYOUT(C)
DO I = 1,N
C(I) = A(I) + B(I)
ENDDO
!$ACC END REGION

DO I = 1,N
write(200,*) I, '=', C(I)
ENDDO
END

% pgf90 test.f90 -ta=nvidia -Minfo=accel -V11.10
main:
     16, Generating copyin(b(:))
         Generating copyin(a(:))
         Generating copyout(c(:))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     17, Loop is parallelizable
         Accelerator kernel generated
         17, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.0 : 5 registers; 52 shared, 4 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 5 registers; 52 shared, 4 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 11 registers; 4 shared, 64 constant, 0 local memory bytes; 100% occupancy
% cat INPUT.DAT 
100000000
% a.out
% tail OUTPUT.DAT 
     99999991 =   2.9999997E+08
     99999992 =   2.9999997E+08
     99999993 =   2.9999997E+08
     99999994 =   2.9999997E+08
     99999995 =   2.9999997E+08
     99999996 =   3.0000000E+08
     99999997 =   3.0000000E+08
     99999998 =   3.0000000E+08
     99999999 =   3.0000000E+08
    100000000 =   3.0000000E+08

Hi Mat,

Thanks for the reply. I had put the write statement before the !$acc end region directive instead of after it. Now I am getting the output. I observed that the compiler generates a ‘copyout(c(1:n))’ even if it is not mentioned explicitly as a clause in the !$acc region directive. Also, I face problems if I use a format specifier in the write statement i.e.

write(200,’(E12.4)’) (C(I),I=1,N)

instead of

write(200,*) (C(I),I=1,N).

Saumik.

I observed that the compiler generates a ‘copyout(c(1:n))’ even if it is not mentioned explicitly

Correct. You only need to use the copy clauses when you want to override the compiler default or when the compiler can not determine the bounds of an array.

Also, I face problems if I use a format specifier in the write statement

What problems?

  • Mat

Hi Mat,
I get zeros in output.dat if I use the format specifier.

Saumik.

Hi Saumik,

It works for me. Though, posting example would help since I can never be sure I’m doing the exact same thing as you as was the case with your previous error.

  • Mat