Pgfortran 20.4 and OpenACC giving "cudaLaunchKernel returned status 2: out of memory"

The following snippet is producing the error message in the title of this message. The code runs successfully on the CPU but not the GPU. Searching the message produced almost no results. Does anyone know the solution?

4942 !$acc parallel loop &
4943 !$acc private(ss_l,dd_l,r_bf_l,q_bf_l,mu79_l,nu79_l, &
4944 !$acc
r_79,r2_79,r3_79,ri_79,ri2_79,ri3_79,ri4_79,ri5_79,ri6_79,ri7_79,ri8_79,&
4945 !$acc
b2i79,ps179,bz179,ph179,vz179,ch179,pst79,bzt79,pt79,rho79,&
4946 !$acc
vis79,vic79,vip79,for79,sir79,bfp079,bfp179,ps079,bz079,ph079,vz079,&
4947 !$acc ch079,pss79,bzs79,bzx79,psx79,bfpx79,pstx79,bztx79,&
4948 !$acc weight_79,norm79)
4949 do j=1,dofs_per_element
4950 ss_l=0.
4951 dd_l=0.
4952 r_bf_l=0.
4953 q_bf_l=0.
4954 mu79_l=mu79
4955 nu79_l(:,:)=nu79(j,:,:)
4956 call compression_lin(mu79_l,nu79_l, &
4957 ss_l,dd_l,r_bf_l,q_bf_l,advfield,izone)
4960 ss(:,j,:)=ss_l(:,:)
4961 dd(:,j,:)=dd_l(:,:)
4962 r_bf(:,j)=r_bf_l(:)
4963 q_bf(:,j)=q_bf_l(:)
4964 end do

Hi jdh4,

The error means that the device ran out of memory. You do have quite a few private variables which at least some of them appear to be arrays. Keep in mind that every gang or vector will each get itā€™s own private copy of the array. If the arrays are large, this can use up a lot of memory.

What are the sizes of your arrays?
Can you post the output from the compiler feedback messages (i.e. add -Minfo=accel to the compile) so I can see how the compiler is scheduling this loop?

If itā€™s using a ā€œgang vectorā€ schedule for the outer loop, then you can try using ā€œ!$acc parallel gangā€ on the outer loop. This way only each gang will get a private copy of the arrays. (Hopefully the compiler will then auto-parallelize the arrays syntax, but you may need to make ā€œcompression_linā€ a vector routine, assuming it contains parallelizable loops, so you donā€™t loose performace)

If itā€™s still too big, you can then set the fix the number of gangs via the ā€œnum_gangs(N)ā€ clause where ā€œNā€ is the max number of gangs you can us where the private arrays fit into memory.

Also, there seems to several variables in the private list that arenā€™t used in the loop. If they arenā€™t used, then they should be removed.

Another possibility is that you change the algorithm to use the 3D arrays directly rather than using the scratch arrays. For example, get rid of ā€œss_lā€ and use ā€œssā€ directly.

-Mat

1 Like

Hi Mat,

Thanks for your helps. Here is our answers to your first two questions:

  1. the size of the arrays
    Most of them are 2D arrays:
    MAX_PTS=60
    OP_NUM= 26
    dofs_per_element=72
    num_fields=3

scalar:surface_int,npoints,bdf
real, dimension(MAX_PTS) :: r_79, r2_79, r3_79, ri_79, ri2_79, ri3_79, ri4_79, ri5_79, ri6_79, ri7_79, ri8_79
real, dimension(MAX_PTS,2) :: norm79
real, dimension(MAX_PTS, OP_NUM) :: tm79, ni79, b2i79, bi79
real, dimension(MAX_PTS, OP_NUM) :: ps179, bz179, pe179, n179, &
real, dimension(MAX_PTS, OP_NUM) :: ps179, bz179, pe179, n179, ph179, vz179, ch179, p179, ne179, pi179
real, dimension(MAX_PTS, OP_NUM) :: pst79, bzt79, pet79, nt79, pht79, vzt79, cht79, pt79, net79
real, dimension(MAX_PTS, OP_NUM) :: rho79, nw79
real, dimension(MAX_PTS, OP_NUM) :: vis79, vic79, vip79, for79, es179
real, dimension(MAX_PTS, OP_NUM) :: jt79, cot79, vot79, pit79, eta79, etaRZ79,sig79, fy79, q79, cd79, totrad79, linerad79, bremrad79, ionrad79, reckrad79, recprad79, sie79, sii79, sir79
real, dimension(MAX_PTS, OP_NUM) :: bfp079, bfp179, bfpt79
real, dimension(MAX_PTS, OP_NUM) :: ps079, bz079, pe079, n079, ph079, vz079, ch079, p079, ne079, pi079
real, dimension(MAX_PTS, OP_NUM) :: pss79, bzs79
real, dimension(MAX_PTS, OP_NUM) :: bzx79, psx79, bfpx79, bfx79, psc79
real, dimension(MAX_PTS, OP_NUM) :: pstx79, bztx79, bfptx79, bftx79
real, dimension(MAX_PTS) :: xi_79, zi_79, eta_79, weight_79
real, dimension(dofs_per_element,num_fields) :: ss_l, dd_l
real, dimension(dofs_per_element) :: r_bf_l, q_bf_l
real, dimension(dofs_per_element, MAX_PTS, OP_NUM) :: mu79_l
real, dimension(MAX_PTS, OP_NUM) :: nu79_l

  1. output from make
    mpifort -c -r8 -Mpreprocess ā€¦ -DUSEBLAS -DPETSC_VERSION=990 -DUSEBLAS -fast -Minfo=accel -Mcuda -acc -ta=tesla ā€¦ ludef_t.f90 -o ludef_t.o
    compression_lin:
    1111, Generating acc routine seq
    Generating Tesla code
    ludefvel_n:
    4855, Generating create(r_bf_l(:),q_bf_l(:),dd_l(:,:),ss_l(:,:)) [if not already present]
    4867, Generating create(nu79_l(:,:),mu79_l(:,:,:)) [if not already present]
    4941, Generating update device(b2i79(:,:),norm79(:,:),for79(:,:),vz179(:,:),weight_79(:),sir79(:,:),bzt79(:,:),pst79(:,:),bfp179(:,:),pstx79(:,:),bztx79(:,:),bz179(:,:),ps179(:,:),bzx79(:,:),vis79(:,:),npoints,bfpx79(:,:),ph179(:,:),bfp079(:,:),bdf,ri_79(:),ch179(:,:),vic79(:,:),surface_int,vip79(:,:),r_79(:),psx79(:,:),bzs79(:,:),pss79(:,:),ch079(:,:),vz079(:,:),ph079(:,:),bz079(:,:),ps079(:,:),ri7_79(:),ri6_79(:),ri5_79(:),ri4_79(:),ri3_79(:),ri2_79(:),rho79(:,:),ri8_79(:),r2_79(:),pt79(:,:),r3_79(:))
    4942, Generating Tesla code
    4949, !$acc loop seq
    4950, !$acc loop seq
    !$acc loop vector(128) ! threadidx%x
    4952, !$acc loop vector(128) ! threadidx%x
    4954, !$acc loop seq
    !$acc loop vector(128) ! threadidx%x
    4955, !$acc loop seq
    !$acc loop vector(128) ! threadidx%x
    4960, !$acc loop seq
    !$acc loop vector(128) ! threadidx%x
    4962, !$acc loop vector(128) ! threadidx%x
    4942, CUDA shared memory used for sir79,ch179,r_79,pt79,for79,norm79,r3_79,q_bf_l,r2_79,ri8_79,rho79,ri2_79,ri3_79,ri4_79,ri5_79,ri6_79,ri7_79,ps079,bz079,ph079,vz079,ch079,pss79,bzs79,psx79,r_bf_l,vip79,ss_l,vic79,dd_l,ri_79,b2i79,bfp079,ph179,bfpx79,nu79_l,vis79,bzx79,ps179,bz179,bztx79,pstx79,bfp179,pst79,bzt79,weight_79,vz179,mu79_l
    Generating implicit copyout(dd(:,:,:)) [if not already present]
    Generating implicit copyin(mu79(:,:,:)) [if not already present]
    Generating implicit copyout(ss(:,:,:),r_bf(:,:)) [if not already present]
    Generating implicit copyin(nu79(:,:,:)) [if not already present]
    Generating implicit copyout(q_bf(:,:)) [if not already present]
    4950, Loop is parallelizable
    4952, Loop is parallelizable
    4954, Loop is parallelizable
    4955, Loop is parallelizable
    4960, Loop is parallelizable
    4962, Loop is parallelizable

Hi Mat,

I also tried gang and experimented num_gangs: 8, 16, 32, ā€¦ .
We still got the error.

It also didnā€™t help by not using local arrays: ss_l, dd_l, r_bf_l, q_bf_l, mu_79_l, nu_79_l.

4942, Generating Tesla code
4949, !$acc loop seq

Looks like the outer loop isnā€™t getting parallelized for some reason, so only one gang is being used and why setting num_gangs wouldnā€™t have an effect.

I was probably a bit off on my initial assessment, but do believe the issue has to do with the volume of arrays youā€™re using in the private clause.

Note the following:

When using a ā€œgangā€ schedule, the compiler will put private arrays in shared memory which is much faster than using global memory. But here you have around 30 arrays being privatized, for a total of about 220K. The shared memory has a max size of 48K , so what may be happening is that is thatā€™s shared running out of memory rather than main memory.

Are all these arrays actually used? I only count 7 of them in code snip-it you show.

Granted, you could be running out of main memory. Thereā€™s several implicit copies as well as an update of several arrays (which implies that you have a higher level data region being used). Though assuming that these arrays have about the same sizing as the ones you show, it seems unlikely that youā€™d be running out of main memory, unless your GPU has a small memory?

As to why the outer loop isnā€™t getting parallelized, Iā€™m not sure. One guess would be if youā€™re passing in a module scalar to ā€œcompression_linā€. In this case, the fix would be to change ā€œcompression_linā€'s interface to pass these variable by value via the ā€œvalueā€ attribute. Otherwise, Iā€™d need a reproducing example to investigate.

-Mat

Yes. All these arrays are pre-calculated and used in subroutine ā€œcompression_linā€.
ss, dd, r_bf. q_bf, advfield, izone are local variables declared in subroutine ludefvel_n. Then how to set the ā€œvalueā€ attribute?

Here is the code structure:

subroutine ludefall
do itri=1,numelms
calculate the variables on the private list involving third-party libraries
call ludefvel_n(itri)
enddo

subroutine ludefvel_n
do j=1,dofs_per_element
call compression_lin
enddo

subroutine compression_lin
do i=1,dofs_per_element
use the variables on the private list to calculate ss, dd, r_bf, q_bf
enddo

ā€“ Jin

Then how to set the ā€œvalueā€ attribute?

The Fortran 2003 ā€œvalueā€ attribute is added to a subroutine argumentā€™s definition. It states that the variable being passed to the subroutine should be passed by value rather than the default pass by reference. Since pass by value creates a copy of the variable as opposed to directly reference the variable, thereā€™s no possibility for depend references. Now if the argument does need to be updated, then you wouldnā€™t want to use ā€œvalueā€.

Again, I donā€™t know if this is the problem and it would only be an issue if the passed in scalar variables were in a module or common block.

It would look something like:

subroutine compression_lin(mu79_l,nu79_l, ss_l,dd_l,r_bf_l,q_bf_l,advfield,izone)
real, dimension(MAX_PTS, OP_NUM) :: u79_l,nu79_l, ss_l,dd_l,r_bf_l,q_bf_l
real, value :: advfield
integer, value :: izone
ā€¦

-Mat

I modified the code to use the ā€œvalueā€ attribute, and declared a set of local variables corresponding to all the variables on the private list, such as

real, dimension(MAX_PTS) :: r_79_l
r_79_l=r_79
!$acc update device(r_79_l,ā€¦
!$acc parallel loop num_gangs(64) &
!$acc private(ss_l,dd_l,r_bf_l,q_bf_l,mu79_l,nu79_l, &
!$acc r_79_l,ā€¦

But it failed at runtime with the following error message:

Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.0, threadid=1
host:0x12bba380 device:0x200093200b00 size:116 presentcount:0+1 line:-1 name:_gyro_21
host:0x13268e80 device:0x200093201400 size:4 presentcount:0+1 line:-1 name:_basic_21
host:0x13a29ae0 device:0x20009c0a5a00 size:576 presentcount:1+0 line:4877 name:r_bf_l
host:0x13a33f20 device:0x20009c0a5e00 size:576 presentcount:1+0 line:4877 name:q_bf_l
host:0x13ab7be0 device:0x20009c0a1e00 size:7488 presentcount:1+0 line:4877 name:ss_l
host:0x13b3d460 device:0x20009c0a3c00 size:7488 presentcount:1+0 line:4877 name:dd_l
host:0x13b3f1a0 device:0x20009c181800 size:12480 presentcount:1+0 line:4889 name:nu79_l
host:0x13b42260 device:0x20009c0a6200 size:898560 presentcount:1+0 line:4889 name:mu79_l
host:0x14123100 device:0x200093200c00 size:4 presentcount:0+1 line:-1 name:_scorec_mesh_mod_16
host:0x14123880 device:0x200093200d00 size:1448 presentcount:0+1 line:-1 name:_nintegrate_16
host:0x14125780 device:0x200093202600 size:208 presentcount:0+1 line:-1 name:_basic_16
host:0x1421c080 device:0x200093202800 size:370080 presentcount:0+1 line:-1 name:_m3dc1_nint_16
host:0x144fcb00 device:0x20009325ce00 size:3360 presentcount:0+1 line:-1 name:_gyroviscosity_16
allocated block device:0x20009c0a1e00 size:7680 thread:1
allocated block device:0x20009c0a3c00 size:7680 thread:1
allocated block device:0x20009c0a5a00 size:1024 thread:1
allocated block device:0x20009c0a5e00 size:1024 thread:1
allocated block device:0x20009c0a6200 size:898560 thread:1
allocated block device:0x20009c181800 size:12800 thread:1

FATAL ERROR: data in update device clause was not found on device 1: name=r_79_l
file:/projects/M3DC1/jinchen/SRC/M3DC1/unstructured.jacc/ludef_t.f90 ludefvel_n line:5011

I donā€™t quite get what it actually says. Do you have any idea?

ā€“ Jin

Now the error is fixed. But Iā€™m still getting the same error message: cudaLaunchKernel returned status 2: out of memory.

Iā€™ll make a short code to reproduce the error. Hope that will help the debugging.

Thanks,

ā€“ Jin

Please do. Iā€™m just making educated guesses without one.

Itā€™s very hard to reproduce it. Now I moved the j do loop into subroutine compression_lin and enclosed compression_lin using data clause. I hope to avoid this OOM error. Here is what I did

4875 subroutine ludefvel_n(itri)
4986 !$acc update device(r_79,r2_79,r3_79,ri_79,ri2_79,ri3_79,ri4_79,ri5_79,ri6_79,ri7_79,ri8_79,b2i79,ps179,bz179,ph179,vz179,ch179,pst79,bzt79,pt79,rho79,vis79,vic79,vip79,for79,sir79,bfp079,bfp179,ps079,bz079,ph079,vz079,ch079,pss79,bzs79,bzx79,psx79,bfpx79,pstx79,bztx79,surface_int,weight_79,norm79,npoints,bdf)
4987 !$acc data copyin(mu79,nu79) copyout(ss,dd,r_bf,q_bf)
5000 call compression_lin(mu79,nu79, &
5001 ss,dd,r_bf,q_bf,advfield,izone)
5007 !$acc end data
1111 subroutine compression_lin(trialx, linx, ssterm, ddterm, r_bf, q_bf, advfield, &
1112 izone)
1178 !$acc parallel loop gang &
1179 !$acc private(ssterm,ddterm,q_bf,r_bf,trialx,linx,tempx,trial,lin,temp, &
1180 !$acc ltemp79a, ltemp79b, ltemp79c, ltemp79d, ltemp79e, ltemp79f, &
1181 !$acc r_79,r2_79,r3_79,ri_79,ri2_79,ri3_79,ri4_79,ri5_79,ri6_79,ri7_79,ri8_79,&
1182 !$acc b2i79,ps179,bz179,ph179,vz179,ch179,pst79,bzt79,pt79,rho79,&
1183 !$acc vis79,vic79,vip79,for79,sir79,bfp079,bfp179,ps079,bz079,ph079,vz079,&
1184 !$acc ch079,pss79,bzs79,bzx79,psx79,bfpx79,pstx79,bztx79,&
1185 !$acc weight_79,norm79)
1186 do j=1, dofs_per_element
1590 !$acc loop vector
1591 do i=1, dofs_per_element
1666 end do
1669 enddo

1670 end subroutine compression_lin

But I got the following error message from compiler:

GF90-S-0155-Invalid accelerator region: branching into or out of region is not allowed (/projects/M3DC1/jinchen/SRC/M3DC1/unstructured.acc/ludef_t.f90: 1178)
compression_lin:
1178, Invalid accelerator region: branching into or out of region is not allowed
0 inform, 0 warnings, 1 severes, 0 fatal for compression_lin

Do you have an idea for why it complains about the parallel do loop?

Thanks,

ā€“ Jin

Thereā€™s something in the code like an ā€˜exitā€™, ā€˜stopā€™, or ā€˜gotoā€™ thatā€™s causing a branch out of the loop. This isnā€™t allowed since it creates a dependency in the loop.

Thanks. Indeed we have several ā€œifā€ constructs and ā€œreturnā€ insides them.

One more question: Which part of memory do the variables, passing as arguments into a OpenACC subroutine, take, global memory or private memory?

Sorry, but Iā€™m not entirely sure what youā€™re asking. Variables are passed on the stack, but you can pass in global, private, or local variables.

I should have made it clear:

If gpu subroutine ā€œcompression_linā€ is called from main, such as

       call compression_lin(mu79,nu79)

       subroutine compression_lin(mu79,nu79)

Will mu79 and nu79 reside on private memory or global memory? And which part of gpu memory that the subroutine compression_lin is offloaded onto
?

Isnā€™t ā€œcompression_linā€ a host subroutine that contains an OpenACC compute region? Theyā€™re just host variables at that point.

Though since these are in an update clause, Iā€™m presuming you have a data region at a higher level which creates the device copy of these variables. These copies would reside in the deviceā€™s global memory. When a compute region is encountered, the compiler runtime will do a ā€œpresentā€ check which does a table look-up on the host address of the variable to find the corresponding address for the device copy then pass the device address to the kernel.

Note that ā€œprivateā€ memory, i.e. memory only accessible by a single thread in a device kernel would be held in registers or local memory (which is stored in global memory). Thereā€™s also shared memory which is located in cache and shared by all the threads in a CUDA block (gang). For gang private variables (i.e. variables in a ā€œprivateā€ clause on a gang-only loop) the compiler will attempt to store these in shared memory.

Thanks for your clear explanation. Now I would like to look into this issue in more detail. Whatā€™s the feature in Nsight that I should use in order to reveal the memory problem?

Thanks,

ā€“ Jin

Nsight-Systems and Nsight-Compute are profilers so wouldnā€™t be helpful to find a runtime error. Nsight-Eclipse is an IDE so you could use the debugger in there, but I personally just use cuda-gdb directly and not the IDE. Though cuda-gdb probably would help in tracking this down either.

For this, Iā€™d suggest setting the environment variable ā€œNVCOMPILER_ACC_DEBUG=1ā€ and pipe stderr to a file (there can be a lot of output). Then post last few lines of the file.