I am new in GPU program with OpenACC. I have some problems in transforming my program into GPU version with OpenACC. The problems are shown below. Could someone give me some help in
understanding these problems?
-
The first problems is about default behavior of variables in parallel region, which means whether the variables in parallel region are thread private or shared between the threads in the conventions of OpenACC standard. I didn’t find a specific definition about this on some tutorial materials. But from some examples, I guess that without additional declaration, all the arrays are shared in the threads, all the scalar variables are firstprivate and all loop variables are private. My guess is true or not?
-
The following question is about allocatable arrays in OpenACC, and I will show it in a simple program.
program main
implicit none
!$acc routine(pnm_openacc) seq
real(kind=8),allocatable ::psitam(:,:)
integer ::nmax,inmax=20000
!$acc parallel private(psitam)
!$acc loop independent
do i=1,nmax
allocate(psitam(100,i))
psitam(1:100,1)=dble(i)
call pnm_openacc(psitam,100,i)
deallocate(psitam)
end do ! loop i
!$acc end parallel
end ! the main program
subroutine pnm_openacc(p,north,i)
!$acc routine seq
implicit none
integer, intent(in) :: north,i
real(kind=8) ::p(north,i)
real(kind=8),allocatable::p1(:)
real(kind=8) ::AA(900,900)
allocate(p1(north))
p=1.D0
p1=1.D0
AA=1.D0
end
In this program, if I compile it with the following command:
pgf90 -acc -ta=nvidia -Minfo=accel ex.f90 -o ex
then the program would not run well and the error message is :
"Failing in Thread:1
call to cuLaunchKernel returned error 1: Invalid value".
Otherwise if I compile it using:
pgf90 -acc -gpu=cc60 -gpu=cuda11.0 -Minfo=accel ex.f90 -o ex, then the program would run have run time problem.
- the third problem is still with the last test program. When compiling this program with command line with the switch -gpu=cc60, the program would not have run time problem, however, if I use the switch -gpu=cc70, the program would have runtime problem and the error message is
" Failing in Thread:1
call to cuLaunchKernel returned error 1: Invalid value".
It seems that I should have cc60 target machine, however, I check the target machine with pgaccelinfo, the information are (not all information included):
Device Number: 0
Device Name: Tesla V100-PCIE-16GB
Device Revision Number: 7.0
Default Target: cc70
I am confusing that I would not get the right program with switch -gpu=cc70 on a V100 machine, which is cc70 Target, however cc60 works.
-
the forth question is about the -gpu=cuda11.0 switch. If I don’t use this switch, I won’t compile the program. Then error message is “pgf90-Error-CUDA version 10.1 is not available in this installation.”. What is the theory behind, and why I need this switch.
-
the fifth question is about the cuda driver version. I show it using a test program from this forum. I post it here and I hope there is no legal problem.
module mod_data
integer, parameter :: dp = selected_real_kind(15, 307)
real(kind=dp), dimension(:,:), allocatable :: A, B
!$acc declare create(A,B)
end module mod_data
module use_data
use mod_data
contains
subroutine fillData (j,val,size)
!$acc routine vector
integer(8), value :: j, size
real(kind=dp), value :: val
integer(8) :: i
real(kind=dp) :: tmp(size)
!$acc loop vector
do i=1,size
tmp(i) = B(j,i) + ((j-1)*M)+(i-1)
end do
!$acc loop vector
do i=1,size
A(j,i) = tmp(i)
end do
end subroutine fillData
end module use_data
program example
use use_data
use cudafor
integer(8) :: N1, M1
integer(8) :: i,j,heapsize
integer :: setheap,istat,iversion
istat=1
iversion=1
istat= cudaDriverGetVersion(iversion)
write(*,*) istat,iversion
istat= cudaRuntimeGetVersion(iversion)
write(*,*) istat,iversion
N1=32
print *, "Input the number of elements to create: "
read(*,*), M1
print *, "Set the device heap? 1 - yes, 0 - no"
read(*,*), setheap
print *, "Heap size needed for automatic array: ", real(M1*N1*dp)/real(1024*1024), "(MB)"
heapsize = 2*(M1*N1*dp)
if (setheap > 0) then
print *, "Setting heapsize=",real(heapsize)/real(1024*1024),"(MB)"
istat = cudaDeviceSetLimit(cudaLimitMallocHeapSize,heapsize);
write(*,*) 'set OK'
endif
allocate(A(N1,M1),B(N1,M1))
!$acc kernels
B=2.5_dp
!$acc end kernels
!$acc parallel loop gang num_gangs(N1)
do j=1,N1
call fillData(j,2.5_dp,M1)
end do
!$acc update self(A)
print *, A(1,1), A(1,M1)
deallocate(A,B)
end program example
I still could not run this program.
If I compile it with the command line
“pgfortran -Minfo=accel -acc -gpu=cc70 -gpu=cuda11.0 -Mcuda setHeap.f90 -o setHeap.out”
the program will stop at the line
“allocate(A(N1,M1),B(N1,M1))”
and the error message is
“call to cudaGetSymbolAddress returned error 35: CUDA driver version is insufficient for CUDA runtime version”.
It says the cuda driver version is not sufficient for runtime version. However, you may notice that
I print both the driver version and runtime version here and the print out message is:
0 9020 ( driver version)
35 9020 (runtime version)
In the printout message driver version equals with runtime version. Where this problem comes from?
Meanwhile, in the command line I also used the switch -gpu=cuda11.0. And I use the nvidia hpc_sdk version 2020/20.9 now. Why the driver version is 9020.
Moreover now I am working at a public linux cluster, I installed nvidia hpc_sdk in my local directory.
If I compile it with the command line:
“pgfortran -Minfo=accel -acc -gpu=cc60 -gpu=cuda11.0 -Mcuda setHeap.f90 -o setHeap.out”
the program will stop at
“!$acc kernels
B=2.5_dp
!$acc end kernels”.
and not other error message.
- The last question is about some notions. It seems that there are many notions here, such as cuda driver version, cuda, cuda driver and cuda runtime, hpc_sdk, cc60, pgi and so on. I am now confusing about these notions now. Is there some reference about these basic knowledges?
These are my questions. Many thanks for your help!