Dealing with allocatable arrays with OpenACC

I am new in GPU program with OpenACC. I have some problems in transforming my program into GPU version with OpenACC. The problems are shown below. Could someone give me some help in
understanding these problems?

  1. The first problems is about default behavior of variables in parallel region, which means whether the variables in parallel region are thread private or shared between the threads in the conventions of OpenACC standard. I didn’t find a specific definition about this on some tutorial materials. But from some examples, I guess that without additional declaration, all the arrays are shared in the threads, all the scalar variables are firstprivate and all loop variables are private. My guess is true or not?

  2. The following question is about allocatable arrays in OpenACC, and I will show it in a simple program.

     program main
    

    implicit none
    !$acc routine(pnm_openacc) seq
    real(kind=8),allocatable ::psitam(:,:)
    integer ::nmax,i

    nmax=20000

    !$acc parallel private(psitam)
    !$acc loop independent
    do i=1,nmax
    allocate(psitam(100,i))
    psitam(1:100,1)=dble(i)
    call pnm_openacc(psitam,100,i)
    deallocate(psitam)
    end do ! loop i
    !$acc end parallel

end ! the main program

subroutine pnm_openacc(p,north,i)
!$acc routine  seq  
implicit none
integer, intent(in) 	          :: north,i 
real(kind=8)			  ::p(north,i)
real(kind=8),allocatable::p1(:)
real(kind=8)			  ::AA(900,900)

allocate(p1(north))
p=1.D0
p1=1.D0
AA=1.D0

end 

In this program, if I compile it with the following command:
pgf90 -acc -ta=nvidia -Minfo=accel ex.f90 -o ex
then the program would not run well and the error message is :
"Failing in Thread:1

call to cuLaunchKernel returned error 1: Invalid value".

Otherwise if I compile it using:
pgf90 -acc -gpu=cc60 -gpu=cuda11.0 -Minfo=accel ex.f90 -o ex, then the program would run have run time problem.

  1. the third problem is still with the last test program. When compiling this program with command line with the switch -gpu=cc60, the program would not have run time problem, however, if I use the switch -gpu=cc70, the program would have runtime problem and the error message is
    " Failing in Thread:1

call to cuLaunchKernel returned error 1: Invalid value".

It seems that I should have cc60 target machine, however, I check the target machine with pgaccelinfo, the information are (not all information included):
Device Number: 0
Device Name: Tesla V100-PCIE-16GB
Device Revision Number: 7.0
Default Target: cc70
I am confusing that I would not get the right program with switch -gpu=cc70 on a V100 machine, which is cc70 Target, however cc60 works.

  1. the forth question is about the -gpu=cuda11.0 switch. If I don’t use this switch, I won’t compile the program. Then error message is “pgf90-Error-CUDA version 10.1 is not available in this installation.”. What is the theory behind, and why I need this switch.

  2. the fifth question is about the cuda driver version. I show it using a test program from this forum. I post it here and I hope there is no legal problem.
    module mod_data
    integer, parameter :: dp = selected_real_kind(15, 307)
    real(kind=dp), dimension(:,:), allocatable :: A, B
    !$acc declare create(A,B)
    end module mod_data

module use_data
use mod_data
contains
subroutine fillData (j,val,size)
!$acc routine vector
integer(8), value :: j, size
real(kind=dp), value :: val
integer(8) :: i
real(kind=dp) :: tmp(size)
!$acc loop vector
do i=1,size
tmp(i) = B(j,i) + ((j-1)*M)+(i-1)
end do

!$acc loop vector
do i=1,size
A(j,i) = tmp(i)
end do
end subroutine fillData
end module use_data

program example
use use_data
use cudafor

integer(8) :: N1, M1
integer(8) :: i,j,heapsize
integer    :: setheap,istat,iversion
istat=1
iversion=1
istat= cudaDriverGetVersion(iversion)
write(*,*) istat,iversion
istat= cudaRuntimeGetVersion(iversion)
write(*,*) istat,iversion 
N1=32
print *, "Input the number of elements to create: "
read(*,*), M1
print *, "Set the device heap? 1 - yes, 0 - no"
read(*,*), setheap
print *, "Heap size needed for automatic array: ", real(M1*N1*dp)/real(1024*1024), "(MB)"
heapsize = 2*(M1*N1*dp)
if (setheap > 0) then
  print *, "Setting heapsize=",real(heapsize)/real(1024*1024),"(MB)"
  istat = cudaDeviceSetLimit(cudaLimitMallocHeapSize,heapsize);
  write(*,*) 'set OK'
endif
allocate(A(N1,M1),B(N1,M1))

!$acc kernels
B=2.5_dp
!$acc end kernels

!$acc parallel loop gang num_gangs(N1)
do j=1,N1
call fillData(j,2.5_dp,M1)
end do
!$acc update self(A)
print *, A(1,1), A(1,M1)
deallocate(A,B)

end program example

I still could not run this program.
If I compile it with the command line
“pgfortran -Minfo=accel -acc -gpu=cc70 -gpu=cuda11.0 -Mcuda setHeap.f90 -o setHeap.out”
the program will stop at the line
“allocate(A(N1,M1),B(N1,M1))”
and the error message is
“call to cudaGetSymbolAddress returned error 35: CUDA driver version is insufficient for CUDA runtime version”.
It says the cuda driver version is not sufficient for runtime version. However, you may notice that
I print both the driver version and runtime version here and the print out message is:
0 9020 ( driver version)
35 9020 (runtime version)
In the printout message driver version equals with runtime version. Where this problem comes from?
Meanwhile, in the command line I also used the switch -gpu=cuda11.0. And I use the nvidia hpc_sdk version 2020/20.9 now. Why the driver version is 9020.

Moreover now I am working at a public linux cluster, I installed nvidia hpc_sdk in my local directory.

If I compile it with the command line:
“pgfortran -Minfo=accel -acc -gpu=cc60 -gpu=cuda11.0 -Mcuda setHeap.f90 -o setHeap.out”
the program will stop at
“!$acc kernels
B=2.5_dp
!$acc end kernels”.
and not other error message.

  1. The last question is about some notions. It seems that there are many notions here, such as cuda driver version, cuda, cuda driver and cuda runtime, hpc_sdk, cc60, pgi and so on. I am now confusing about these notions now. Is there some reference about these basic knowledges?

These are my questions. Many thanks for your help!

Hi wliang246,

Let’s see if I can help.

#1. Allocatable arrays are shared while scalars are first private by default.

#2. There are several issues here.

!$acc parallel private(psitam)

By putting the private clause on the parallel region, the variable is shared at the gang level. Given there’s only one loop which then will have a “gang vector” schedule, the array will be shared by all vectors within a gang. This will cause race conditions as well as other problems. Instead, the private clause should be added to the “loop” directive instead.

allocate(psitam(100,i))
allocate(p1(north))

While using allocate with a compute region is legal, it’s highly discouraged. Besides slowing down the performance (allocs are serialized), the device heap is very small. The code will very likely overflow the heap causing an illegal memory error. Better to allocate the array on the host and then copy it to the device.

real(kind=8) ::AA(900,900)

The stack is very small as well, so having a very large static array, “AA”, will cause problems as well. Granted in this case, the subroutine will most likely be inlined, so may not cause a stack overflow, but you should reconsider using this size of a local array.

Below is a small example. I did change the code so that “pnm_openacc” is in a module, and hence has an implicit interface, so I can pass in “p” as an assumed share array. Otherwise, you’re using F77 calling conventions.

 % cat test.f90

module testme

contains

subroutine pnm_openacc(p,north,i)
!$acc routine  seq
implicit none
integer, value, intent(in)                :: north,i
real(kind=8)                      ::p(:,:)
p(i,1:north)=real(i)
end subroutine pnm_openacc

end module testme

program main
use testme
implicit none
real(kind=8),allocatable ::psitam(:,:)
integer ::nmax,i

nmax=20000
allocate(psitam(nmax,100))

!$acc parallel copyout(psitam)
!$acc loop
do i=1,nmax
call pnm_openacc(psitam,100,i)
end do ! loop i
!$acc end parallel

print *, psitam(3,1:2)
print *, psitam(10,1:2)
print *, psitam(nmax,1:2)
deallocate(psitam)

end ! the main program

% nvfortran -acc test.f90; a.out
    3.000000000000000         3.000000000000000
    10.00000000000000         10.00000000000000
    20000.00000000000         20000.00000000000

#3. The “cc70” version is failing due to the issues I describe above. Since a “cc60” built binary can’t be run on on V100, most likely your program is running on the host. If a binary can’t be run on the device it’s been compiled for, it will fall back to run on the host.

#4, #5. These are the same issue.

Your CUDA Driver is most likely version 10.1. However, you downloaded the NVHPC SDK that contains only support for CUDA 11. There’s two packages due to size. There’s one with the current and two past CUDA releases, but is rather large, and a second with just the most current release.

By default, the compiler will target the CUDA version of the CUDA Driver. Since you don’t have CUDA 10.1 installed, the error from #4. When you build using CUDA 11, the 10.1 CUDA Driver is too old to run this binary.

You have two choices. Either update your CUDA Driver to 11, or download the full NVHPC SDK containing CUDA 10.1.

#6.

I don’t think there’s one spot for all of these items but you can start with the HPC SDK Release Notes and the Compiler Documentation.

Hope his helps,
Mat

Hi, Mat,

at first I would say Many thanks to you for your help on this problem. It is very helpful for me to understand the problem and find my wrong knowledge about openacc programing. I am also lucky to know these issues at a very early stage in using this programing standard. However, I still have some puzzles now, I would like to ask further.

#1. Yes, and are the non-allocatable arrays also shared by default?

#2.

  1. !$acc parallel private(psitam)
    Yes, I would define the private arrays using private clause on the “loop” directive in the future. Moreover, I would like to ask in this case, there is only one loop and the loop will have a “gang vector”. If n_loop is the numbers of loops, will this region run in n_loop gangs 1 vector mode, or in 1 gang n_loop vectors, or other modes, such as 2 gangs with n_loop/2 vectors in each gang. I think if it is n_loop gangs, then there would be no race conditions among vectors? Is it true?

  2. allocate(psitam(100,i))
    As you said, the heap size is very small, thus illegal memory error may happen. In runtime which memory level would the heap be in, the global memory, or shared memory in GPU. I know we could reset the heap size of the GPU, could this additional setting help to solve the problem? In Tesla V100, there is near 16GB memory, but this is not enough for such a small program? Why in CPU condition, such heap size problem won’t happen, is this due to the small size of memory? What is the common size of heap in GPU?

  3. real(kind=8) ::AA(900,900)
    For this issue, I would ask how to judge the subroutine is inline or not? And why inlined subroutines would not cause stack overflow. Is this due to the fact that the calling subroutine and the main program have their stacks in different regions? Moreover, for me, I would use big array very often in subroutines, how to deal with this issue?

  4. Thanks for your example program, it works now. But I would ask what is the main difference between F77 calling conventions and the one used now? What is the function of interface. As I always used F77 conventions in my own program.

#3. Now I could successfully compile and run the test program with “cc70” now. I would ask why the program is much slower when it fall back to run on the host than its counterpart version that is compiled only for the host. Moreover, before now, I think the parallel region is compiled only for the device, and thus it could only run on the host. How this region could run on the host? The host version is compiled meanwhile?

#4,#5, #6. OK, I would install the new driver and read the reference materials.

Best regards,
wliang246

Yes, I should have been more general. Arrays in general are shared by default.

#2.1

I’m not 100% clear on the question, but if you have something like:

!$acc parallel private(array)
!$acc loop gang
for (int i=0; ....
// Array is private to each gang
!$acc loop vector
    for (int j=0; j < arraySize; ++j) {
// Array is shared by the vectors
        array[j] = ....

#2.2

The amount of global memory on the device is irrelevant as the kernel’s heap space is only around 32MB total for all threads. You can increase this a bit, but even if you can get the allocation to fit within the heap, allocation hurt performance and the allocation lifetime is only for the duration of the kernel. Best to simply avoid device side allocation and perform allocations from the host.

#2.3.a " I would ask how to judge the subroutine is inline or not?"

There’s several ways in which inlining occurs. The case I’m suggesting is that when the device routine is located in the same file as the calling kernel, the back-end CUDA device compiler will often inline these routines. There’s not a way that I’m aware of to know for sure.

You can have the front-end Fortran compiler inline via the “-Minline” flag. This occurs prior to the device code generation and, if successful, the compiler will emit a feedback message when the “-Minfo=inline” flag is used. For cross-file inlining where each object is compiled separately, you first need to perform an extract pass and store the inline information in a library, i.e. compile with “-Mextract=lib:libname”, followed by a second compilation with “-Minline=lib:libname”. Multiple files compiled together, i.e. “nvfortran -Minline file1.f90 file2.f90”, do not need the extract pass.

#2.3.b " And why inlined subroutines would not cause stack overflow. Is this due to the fact that the calling subroutine and the main program have their stacks in different regions?"

The stack is used when calling subroutines. If it’s inlined, then there’s no call, and hence no stack usage. Each kernel (OpenACC compute construct) will have it’s own stack. The stack, like the heap, is small at about 32MB.

#2.3.c “Moreover, for me, I would use big array very often in subroutines, how to deal with this issue?”

Best to pass in the array as an argument to the subroutine where the array is private to one of the compute region’s loops, or manually privatized (i.e. a shared array with added extra dimension’s size to the loop bounds).

Note depending on the algorithm, processing large arrays within a sequential device subroutine may not be performant due to SIMT and memory divergence. CUDA threads (OpenACC vectors) are grouped into a “warp” where SIMT (Single Instruction Multiple Threads) are performed. Basically, all threads within a warp execute the same instruction at the same time, just on different data. Hence, when one thread accesses memory, all threads access memory. Since the memory of a local private array is not contiguous across the threads in a warp, the kernel will get memory divergence, as all threads within the warp must wait for all other threads to complete the memory access. If the memory is contiguous, then all the memory can be brought into cache as one larger chunk.

Without seeing your code it’s difficult to give specific advices, but you may consider making your subroutines “vector”, assuming you have parallel vector loops within the subroutine, so that the array’s memory can be contiguous across threads in a warp.

#2.4 “But I would ask what is the main difference between F77 calling conventions and the one used now? What is the function of interface. As I always used F77 conventions in my own program.”

It’s not that you can’t use F77 interfaces, but you should really consider moving to Fortran90, especially for new code. It’s too big of a topic to cover here, but mostly you’re going to get the ability to use modules, interfaces, and the ability to pass F90 allocatable arrays using assume shape to subroutines.

#3 How this region could run on the host? The host version is compiled meanwhile?

The compiler will create both a sequential host and device version of the code, falling back to host version if it’s unable to run on the device. You can have it fall back to running parallel on the host if you use the flags “-gpu=tesla,multicore”, where “multicore” indicates running the OpenACC compute regions across multicore CPUs.

Hi, Mat,

Thanks again for your answers and I have understand all the issues now.

My ultimate goal is to using multi GPUs in the cluster and get the program accelerated on a big scale. However, when the ‘MPI_SEND’ subroutine is not possible to be compiled and I have shown the example code and the error message below. Could you help me find out the issues in this program?

example.f90
program main
use mpi
use omp_lib
use openacc

implicit none

integer(kind=4)::num_coef,nmax,n_north,n_east,i,j,k,m,num_grid,prd
real(kind=8),allocatable::grav(:,:),asitam(:,:),bsitam(:,:)
real(kind=8):: error
integer(kind=4)::count,ierr,myid,numprocs,rc,request,status(MPI_STATUS_SIZE)
character(len=40) filedgcombine
complex*16,allocatable:: fftdata_cpx(:)
integer:: num_device,idx_device

call MPI_INIT_thread(MPI_THREAD_MULTIPLE,prd,ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,numprocs,ierr)

num_device= acc_get_num_devices(acc_device_nvidia)

call MPI_BARRIER(MPI_COMM_WORLD,ierr)
if(myid==0) then
	write(*,*) "please input n_east,n_north,num_grid and nmax"
	read(*,*) n_east,n_north,num_grid,nmax
end if
call MPI_BCAST(n_east,1,MPI_INTEGER4,0,MPI_COMM_WORLD,ierr)
call MPI_BCAST(n_north,1,MPI_INTEGER4,0,MPI_COMM_WORLD,ierr)
call MPI_BCAST(num_grid,1,MPI_INTEGER4,0,MPI_COMM_WORLD,ierr)
call MPI_BCAST(nmax,1,MPI_INTEGER4,0,MPI_COMM_WORLD,ierr)

allocate(grav(num_grid,4))
allocate(fftdata_cpx(n_east))
if(myid==0) then
	allocate(asitam(n_north,nmax))
	allocate(bsitam(n_north,nmax))
end if
	if(myid==0) then
    	write(*,*) "please input the gravity anomaly file name and error "
    	read(*,*) filedgcombine,error
    	open(unit=10,file=filedgcombine)
    	do i=1,num_grid
      		read(10,*) grav(i,1),grav(i,2),grav(i,3),grav(i,4)
    	end do
    	close(10)     
	end if
	call MPI_BARRIER(MPI_COMM_WORLD,ierr)
	call MPI_BCAST(grav,num_grid*4,MPI_REAL8,0,MPI_COMM_WORLD,ierr)

call acc_init(acc_device_nvidia)

If(myid==0) then

    asitam=0.D0
    bsitam=0.D0

!$OMP PARALLEL DEFAULT(PRIVATE),SHARED(nmax,n_north,n_east,asitam,bsitam),num_threads(32)
!$OMP DO schedule(dynamic)
do j=1,n_north

    call MPI_RECV(fftdata_cpx,n_east,MPI_COMPLEX,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,STATUS,ierr)

    count=status(MPI_TAG)
    asitam(count,1)=dble(fftdata_cpx(1))
    bsitam(count,1)=0.D0
    asitam(count,2:nmax)= 2.D0*dble(fftdata_cpx(2:nmax))
    bsitam(count,2:nmax)=-2.D0*aimag(fftdata_cpx(2:nmax))

end do  !  j loop

!$OMP END DO
!$OMP END PARALLEL

else

idx_device=mod(myid,num_device)
call acc_set_device_num(idx_device,acc_device_nvidia)
!$acc kernels
!$acc loop independent private(fftdata_cpx)
do i=myid, n_north, numprocs-1
	fftdata_cpx=dcmplx(0.D0,0.D0)
	!$acc loop independent
	do j=1, n_east
	        fftdata_cpx(j)=dcmplx(grav((i-1)*n_east+j,4))
	end do
	fftdata_cpx=dconjg(fftdata_cpx)/dble(n_east)
	call MPI_Send(fftdata_cpx,n_east,MPI_COMPLEX,0,i,MPI_COMM_WORLD,ierr)
end do   ! loop i
!$acc end kernels

end if

if(myid/=0) then
     allocate(asitam(n_north,nmax))
     allocate(bsitam(n_north,nmax))
end if
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
call MPI_BCAST(asitam,n_north*nmax,MPI_REAL8,0,MPI_COMM_WORLD,ierr)
call MPI_BCAST(bsitam,n_north*nmax,MPI_REAL8,0,MPI_COMM_WORLD,ierr)
call MPI_BARRIER(MPI_COMM_WORLD,ierr)


call MPI_FINALIZE(rc)

end ! the main program

the compile command is:
mpif90 -acc -mp -gpu=cc70 -gpu=cuda11.0 -Minfo=accel example.f90 -o example

and the error message is:

NVFORTRAN-S-0155-Procedures called in a compute region must have acc routine information: mpi_send (ggg.f90: 85)
NVFORTRAN-S-0155-Accelerator region ignored; see -Minfo messages (example.f90: 76)
0 inform, 0 warnings, 2 severes, 0 fatal for main
main:
76, Accelerator region ignored
85, Accelerator restriction: call to ‘mpi_send’ with no acc routine information

Hi wliang246,

The MPI API calls such as MPI_SEND are host routines can’t be called from within device code so what you have here wont work. If you have a CUDA Aware enabled MPI implementation, such as the OpenMPI we ship with the compilers, then you can pass in device variables (by using the OpenACC “host_data” construct).

-Mat

Hi, Mat

Many thanks for your reply.

So, you mean I could send device variables by MPI_SEND within the OpenACC “host_data” construct if OpenMPI is installed. Do you have some reference or example code to implement this?

Many thanks!

If you do a web search for “MPI OpenACC” you’ll find several references. Granted, a lot are in C and several years old, but should give you some background.

Also, you may want to take class #2: https://developer.nvidia.com/openacc-advanced-course

Basically, you’d have something like:

!$ACC HOST_DATA use_device(left_snd_buffer,left_rcv_buffer)
    CALL MPI_ISEND(left_snd_buffer,total_size,MPI_DOUBLE_PRECISION,left_task,tag_send &
      ,MPI_COMM_WORLD,req_send,err)

    CALL MPI_IRECV(left_rcv_buffer,total_size,MPI_DOUBLE_PRECISION,left_task,tag_recv &
      ,MPI_COMM_WORLD,req_recv,err)
!$ACC END HOST_DATA

-Mat

Hi, Mat

Many thanks for your reply!

I would take the course at first and then try to use the OpenACC HOST_DATA afterwards.