Accelerator Fatal Error: No NVIDIA/CUDA version...

Hi, i’m using PGIworkstation version 14.4 to parallelize my fortran code, which is used for testing.
But a fatal error occurred when I run the executable, which said “Accelerator Fatal Error: No NVIDIA/CUDA version of this construct available for the current device”.
I am using PGIworkstation on windows, and I am not sure if i should replace windows with linux and download PGI community edition on this website. If not so, is there any solution to this problem? Please help me solve this problem.

Here is my code:
program main
implicit none
integer::m,k,i,n=0
real t1,t2
call CPU_TIME(t1)

!$acc kernels
do m=1,10000000,2
k=sqrt(real(m))
do i=2,k
if(mod(m,i)==0)exit
end do
if(i>k)then
n=n+1
end if
end do
!$acc end kernels

call CPU_TIME(t2)

print*,n
print*,t2-t1
End
[/img][/url]

Hi hjd1234567,

Your code runs fine for me on my Windows system using 14.4, so I suspect something else is going on. What device are you using? (if you don’t know, please run the pgaccelinfo utility). Also, can you please post the compilation line you’re using and the full text of the output?

-Mat


PGI$ pgf90 -Minfo test.f90  -fast -acc -Minfo -o mytest.exe -V14.4
main:
      7, Generating Tesla code
      8, Loop is parallelizable
         Accelerator kernel generated
          8, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         14, Sum reduction generated for n
     10, Inner sequential loop scheduled on accelerator
         Loop not vectorized/parallelized: potential early exits
     11, Accelerator restriction: induction variable live-out from loop: i
     12, Accelerator restriction: induction variable live-out from loop: i
PGI$ mytest.exe
       664579
    1.641000

Hi mkcolg,

The device name is GeForce GTX 1060. The CUDA version on my computer is 8.0.


E:\fortran\Console3\Console3>pgfortran console3.F90 -acc -Minfo
main:
7, Accelerator kernel generated
8, !$acc loop vector(256) ! threadidx%x
14, Sum reduction generated for n
7, Generating Tesla code
8, Loop is parallelizable
11, Accelerator restriction: induction variable live-out from loop: i
12, Accelerator restriction: induction variable live-out from loop: i

E:\fortran\Console3\Console3>console3
Accelerator Fatal Error: No NVIDIA/CUDA version of this construct available for the current device
File: E:\fortran\Console3\Console3\console3.F90
Function: main
Line: 7

PGI 14.4 is a few years old so doesn’t support the new Pascal architecture which your GTX 1060 uses. You’ll need to upgrade to a later PGI version and then compile with “-ta=tesla:cc60”.

PGI’s next release will be available soon. If you can, I’d suggest waiting for this release before upgrading.

-Mat

Thanks for your reply. Now I have two questions to ask.

First, I have to finish my work within a few days, so I want to know if PGI community edition on the windows system will be available soon?

Second, If I want to use OpenACC on the linux system, which software do I need to install except PGI community edition on the linux system and CUDA8.0?

Thank you very much!

If I want to use OpenACC on the linux system, which software do I need to install except PGI community edition on the linux system and CUDA8.0?

To build OpenACC programs, all you need to download the PGI compilers. We ship all needed components. However to run your program, you’ll need to also install the CUDA driver.

-Mat

First, I have to finish my work within a few days, so I want to know if PGI community edition on the windows system will be available soon?

Now that 17.4 has been released, I can confirm that the PGI Community Edition is now available for Windows. The caveat being that you will need to have Microsoft Visual C++ 2015 installed before installing the PGI compilers.

See: http://www.pgroup.com/products/community.htm and Windows Co-install Requirements | PGI for details.

Mat

Thank you very much!

Now I am using PGI compilers on the linux system to speed up a complicated program. However, once I use OpenACC, such as “!$acc kernels” ,an error occurs: “call to … returned error700: Illegal address during kernel execution.”

Do you know how to solve this problem? Please give me some advice.

Hi hjd1234567,

An illegal address error is a generic error that can have multiple causes. It’s basically a seg fault on the device. Here’s some common causes:

  1. Out-of-bounds access of an array.
  2. Accessing a host address on the device
  3. Stack or heap overflow on the device
  4. Privatized array that grows too large

Can you post or send to PGI Customer Service (trs@pgroup.com) a reproducing example? I can then take a look and try to determine the cause.

-Mat

type array
  real(8), pointer, dimension(:) :: r => null()
end type

integer i
type(array) :: a, b

allocate(a%r(100),b%r(100))

a%r = 1; b%r = 2

!$acc kernels
do i = 1,100
   a%r(i)=b%r(i)
enddo
!$acc end kernels

write(*,*) a%r(1:10)
end

compile command:
pgfortran -acc -ta=tesla:cc60 -Minfo test.f90

When I run this code, it gives me this error:
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Can you give me some advice on how to solve this problem? Thank you very much!
[/code]

You’re accessing host pointers on the device and why you’re getting the illegal address error. What’s happening is that since you aren’t using data regions, the compiler is implicitly copying a and b in to the device. However by default, this is a shallow copy of a and b, so the host address of r is copied over. To fix, you need to use data regions to explicitly create a and b, and then create each r array on the device. Something like the following:

% cat test.f90
type array
  real(8), pointer, dimension(:) :: r => null()
end type

integer i
type(array) :: a, b

allocate(a%r(100),b%r(100))

a%r = 1; b%r = 2

!$acc enter data create(a,b)
!$acc enter data create(a%r(100))
!$acc enter data copyin(b%r(100))

!$acc kernels loop independent present(a,b)
do i = 1,100
   a%r(i)=b%r(i)
enddo
!$acc end kernels

!$acc exit data copyout(a%r(100))
!$acc exit data delete(b%r)
!$acc exit data delete(a,b)

write(*,*) a%r(1:10)
end
% pgfortran test.f90 -Minfo=accel -acc -ta=tesla:cc60; a.out
MAIN:
     12, Generating enter data create(a,b)
     13, Generating enter data create(a%r(:100))
     14, Generating enter data copyin(b%r(:100))
     16, Generating present(b,a)
     17, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         17, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     22, Generating exit data copyout(a%r(:100))
     23, Generating exit data delete(b%r(:))
     24, Generating exit data delete(b,a)
    2.000000000000000         2.000000000000000         2.000000000000000
    2.000000000000000         2.000000000000000         2.000000000000000
    2.000000000000000         2.000000000000000         2.000000000000000
    2.000000000000000

Basically, any time you have an aggregate type with dynamic data members, you need to manually deep copy the data structure to/from the device. Order is important in that the type needs to be created before the data members. Also since the data clauses and update directive perform shallow copies, it’s important to only copy the “r” array. Copying a or b would overwrite the pointer’s address, not what it points to.

Note that I also added the “loop independent” clause to force the compiler to parallelize the loop. Since r is a pointer, the compiler must assume a and b point to the same r and therefore causes a loop dependency which prevents parallelization. If r was an allocatable, “independent” would not be needed.

Also, I added “present(a,b)” indicating to the compiler that a and b are being managed via data regions. Otherwise, it needs to create the implicit copy.

Hope this helps,
Mat

Thank you! Your advice helps me a lot.

Now I met with another error. Here is my code:

module ma

integer :: N=10

type array
  real(8), allocatable, dimension(:) :: r
end type

interface assignment(=)
   module procedure array_equal_array
end interface

interface operator(+)
   module procedure array_add_array
end interface

contains

!--------------------------------------------
   subroutine array_equal_array(a,b)
!-------------------------------------------- 

implicit none

type(array), intent(inout) :: a
type(array), intent(in) :: b
integer :: i

do i = 1, N
a%r(i)=b%r(i)
enddo

return
end subroutine



!--------------------------------------------
   function array_add_array(a,b) result(c)
!--------------------------------------------

implicit none

type(array), intent(in) :: a, b
type(array), allocatable :: c
integer :: i


allocate(c)
allocate(c%r(N))


!$acc enter data create(a,b,c)
!$acc enter data create(c%r(N))
!$acc enter data copyin(a%r(N),b%r(N))

!$acc kernels loop present(a,b,c)
do i = 1,N
   c%r(i)=a%r(i)+b%r(i)
enddo
!$acc end kernels

!$acc exit data copyout(c%r(N))
!$acc exit data delete(a%r(N),b%r(N))
!$acc exit data delete(a,b,c)
write(*,*) c%r(1:N)


end function
end module

use ma
type(array):: a,b,d

allocate(a%r(N),b%r(N),d%r(N))
a%r = 1; b%r = 2

d=array_add_array(a,b)
write(*,*) d%r

end

pgfortran -acc -ta=tesla:cc60 -Minfo test1.f90

array_equal_array:
29, Memory copy idiom, loop replaced by call to __c_mcopy8
array_add_array:
53, Generating enter data create(a,b,c)
54, Generating enter data create(c%r(:n))
55, Generating enter data copyin(a%r(:n),b%r(:n))
57, Generating present(a,b,c)
58, Loop carried dependence of c%r$p prevents parallelization
Complex loop carried dependence of a%r$p,b%r$p prevents parallelization
Loop carried backward dependence of c%r$p prevents vectorization
Accelerator kernel generated
58, !$acc loop seq
58, Accelerator scalar kernel generated
63, Generating exit data copyout(c%r(:n))
64, Generating exit data delete(a%r(:n),b%r(:n))
65, Generating exit data delete(a,b,c)


Then an error occurs:

./a.out

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Do you know how to solve this problem?

While you did forget to add the “independent” clause so the loop isn’t getting parallelized, the illegal address error looks to be a compiler issue where it’s not properly handling the result variable when that variable is a UDT and allocated. I’ve added a problem report (TPR#24289) and sent it to engineering for further investigation.

The work around would be to create a second temp variable to use in the compute region and then copy the temp, to the result variable after the end of the compute region.

Thanks!
Mat