Pgfortran 20.4 and OpenACC giving "cudaLaunchKernel returned status 2: out of memory"

The following snippet is producing the error message in the title of this message. The code runs successfully on the CPU but not the GPU. Searching the message produced almost no results. Does anyone know the solution?

4942 !$acc parallel loop &
4943 !$acc private(ss_l,dd_l,r_bf_l,q_bf_l,mu79_l,nu79_l, &
4944 !$acc
4945 !$acc
4946 !$acc
4947 !$acc ch079,pss79,bzs79,bzx79,psx79,bfpx79,pstx79,bztx79,&
4948 !$acc weight_79,norm79)
4949 do j=1,dofs_per_element
4950 ss_l=0.
4951 dd_l=0.
4952 r_bf_l=0.
4953 q_bf_l=0.
4954 mu79_l=mu79
4955 nu79_l(:,:)=nu79(j,:,:)
4956 call compression_lin(mu79_l,nu79_l, &
4957 ss_l,dd_l,r_bf_l,q_bf_l,advfield,izone)
4960 ss(:,j,:)=ss_l(:,:)
4961 dd(:,j,:)=dd_l(:,:)
4962 r_bf(:,j)=r_bf_l(:)
4963 q_bf(:,j)=q_bf_l(:)
4964 end do

Hi jdh4,

The error means that the device ran out of memory. You do have quite a few private variables which at least some of them appear to be arrays. Keep in mind that every gang or vector will each get it’s own private copy of the array. If the arrays are large, this can use up a lot of memory.

What are the sizes of your arrays?
Can you post the output from the compiler feedback messages (i.e. add -Minfo=accel to the compile) so I can see how the compiler is scheduling this loop?

If it’s using a “gang vector” schedule for the outer loop, then you can try using “!$acc parallel gang” on the outer loop. This way only each gang will get a private copy of the arrays. (Hopefully the compiler will then auto-parallelize the arrays syntax, but you may need to make “compression_lin” a vector routine, assuming it contains parallelizable loops, so you don’t loose performace)

If it’s still too big, you can then set the fix the number of gangs via the “num_gangs(N)” clause where “N” is the max number of gangs you can us where the private arrays fit into memory.

Also, there seems to several variables in the private list that aren’t used in the loop. If they aren’t used, then they should be removed.

Another possibility is that you change the algorithm to use the 3D arrays directly rather than using the scratch arrays. For example, get rid of “ss_l” and use “ss” directly.


1 Like

Hi Mat,

Thanks for your helps. Here is our answers to your first two questions:

  1. the size of the arrays
    Most of them are 2D arrays:
    OP_NUM= 26

real, dimension(MAX_PTS) :: r_79, r2_79, r3_79, ri_79, ri2_79, ri3_79, ri4_79, ri5_79, ri6_79, ri7_79, ri8_79
real, dimension(MAX_PTS,2) :: norm79
real, dimension(MAX_PTS, OP_NUM) :: tm79, ni79, b2i79, bi79
real, dimension(MAX_PTS, OP_NUM) :: ps179, bz179, pe179, n179, &
real, dimension(MAX_PTS, OP_NUM) :: ps179, bz179, pe179, n179, ph179, vz179, ch179, p179, ne179, pi179
real, dimension(MAX_PTS, OP_NUM) :: pst79, bzt79, pet79, nt79, pht79, vzt79, cht79, pt79, net79
real, dimension(MAX_PTS, OP_NUM) :: rho79, nw79
real, dimension(MAX_PTS, OP_NUM) :: vis79, vic79, vip79, for79, es179
real, dimension(MAX_PTS, OP_NUM) :: jt79, cot79, vot79, pit79, eta79, etaRZ79,sig79, fy79, q79, cd79, totrad79, linerad79, bremrad79, ionrad79, reckrad79, recprad79, sie79, sii79, sir79
real, dimension(MAX_PTS, OP_NUM) :: bfp079, bfp179, bfpt79
real, dimension(MAX_PTS, OP_NUM) :: ps079, bz079, pe079, n079, ph079, vz079, ch079, p079, ne079, pi079
real, dimension(MAX_PTS, OP_NUM) :: pss79, bzs79
real, dimension(MAX_PTS, OP_NUM) :: bzx79, psx79, bfpx79, bfx79, psc79
real, dimension(MAX_PTS, OP_NUM) :: pstx79, bztx79, bfptx79, bftx79
real, dimension(MAX_PTS) :: xi_79, zi_79, eta_79, weight_79
real, dimension(dofs_per_element,num_fields) :: ss_l, dd_l
real, dimension(dofs_per_element) :: r_bf_l, q_bf_l
real, dimension(dofs_per_element, MAX_PTS, OP_NUM) :: mu79_l
real, dimension(MAX_PTS, OP_NUM) :: nu79_l

  1. output from make
    mpifort -c -r8 -Mpreprocess … -DUSEBLAS -DPETSC_VERSION=990 -DUSEBLAS -fast -Minfo=accel -Mcuda -acc -ta=tesla … ludef_t.f90 -o ludef_t.o
    1111, Generating acc routine seq
    Generating Tesla code
    4855, Generating create(r_bf_l(:),q_bf_l(:),dd_l(:,:),ss_l(:,:)) [if not already present]
    4867, Generating create(nu79_l(:,:),mu79_l(:,:,:)) [if not already present]
    4941, Generating update device(b2i79(:,:),norm79(:,:),for79(:,:),vz179(:,:),weight_79(:),sir79(:,:),bzt79(:,:),pst79(:,:),bfp179(:,:),pstx79(:,:),bztx79(:,:),bz179(:,:),ps179(:,:),bzx79(:,:),vis79(:,:),npoints,bfpx79(:,:),ph179(:,:),bfp079(:,:),bdf,ri_79(:),ch179(:,:),vic79(:,:),surface_int,vip79(:,:),r_79(:),psx79(:,:),bzs79(:,:),pss79(:,:),ch079(:,:),vz079(:,:),ph079(:,:),bz079(:,:),ps079(:,:),ri7_79(:),ri6_79(:),ri5_79(:),ri4_79(:),ri3_79(:),ri2_79(:),rho79(:,:),ri8_79(:),r2_79(:),pt79(:,:),r3_79(:))
    4942, Generating Tesla code
    4949, !$acc loop seq
    4950, !$acc loop seq
    !$acc loop vector(128) ! threadidx%x
    4952, !$acc loop vector(128) ! threadidx%x
    4954, !$acc loop seq
    !$acc loop vector(128) ! threadidx%x
    4955, !$acc loop seq
    !$acc loop vector(128) ! threadidx%x
    4960, !$acc loop seq
    !$acc loop vector(128) ! threadidx%x
    4962, !$acc loop vector(128) ! threadidx%x
    4942, CUDA shared memory used for sir79,ch179,r_79,pt79,for79,norm79,r3_79,q_bf_l,r2_79,ri8_79,rho79,ri2_79,ri3_79,ri4_79,ri5_79,ri6_79,ri7_79,ps079,bz079,ph079,vz079,ch079,pss79,bzs79,psx79,r_bf_l,vip79,ss_l,vic79,dd_l,ri_79,b2i79,bfp079,ph179,bfpx79,nu79_l,vis79,bzx79,ps179,bz179,bztx79,pstx79,bfp179,pst79,bzt79,weight_79,vz179,mu79_l
    Generating implicit copyout(dd(:,:,:)) [if not already present]
    Generating implicit copyin(mu79(:,:,:)) [if not already present]
    Generating implicit copyout(ss(:,:,:),r_bf(:,:)) [if not already present]
    Generating implicit copyin(nu79(:,:,:)) [if not already present]
    Generating implicit copyout(q_bf(:,:)) [if not already present]
    4950, Loop is parallelizable
    4952, Loop is parallelizable
    4954, Loop is parallelizable
    4955, Loop is parallelizable
    4960, Loop is parallelizable
    4962, Loop is parallelizable

Hi Mat,

I also tried gang and experimented num_gangs: 8, 16, 32, … .
We still got the error.

It also didn’t help by not using local arrays: ss_l, dd_l, r_bf_l, q_bf_l, mu_79_l, nu_79_l.

4942, Generating Tesla code
4949, !$acc loop seq

Looks like the outer loop isn’t getting parallelized for some reason, so only one gang is being used and why setting num_gangs wouldn’t have an effect.

I was probably a bit off on my initial assessment, but do believe the issue has to do with the volume of arrays you’re using in the private clause.

Note the following:

When using a “gang” schedule, the compiler will put private arrays in shared memory which is much faster than using global memory. But here you have around 30 arrays being privatized, for a total of about 220K. The shared memory has a max size of 48K , so what may be happening is that is that’s shared running out of memory rather than main memory.

Are all these arrays actually used? I only count 7 of them in code snip-it you show.

Granted, you could be running out of main memory. There’s several implicit copies as well as an update of several arrays (which implies that you have a higher level data region being used). Though assuming that these arrays have about the same sizing as the ones you show, it seems unlikely that you’d be running out of main memory, unless your GPU has a small memory?

As to why the outer loop isn’t getting parallelized, I’m not sure. One guess would be if you’re passing in a module scalar to “compression_lin”. In this case, the fix would be to change “compression_lin”'s interface to pass these variable by value via the “value” attribute. Otherwise, I’d need a reproducing example to investigate.


Yes. All these arrays are pre-calculated and used in subroutine “compression_lin”.
ss, dd, r_bf. q_bf, advfield, izone are local variables declared in subroutine ludefvel_n. Then how to set the “value” attribute?

Here is the code structure:

subroutine ludefall
do itri=1,numelms
calculate the variables on the private list involving third-party libraries
call ludefvel_n(itri)

subroutine ludefvel_n
do j=1,dofs_per_element
call compression_lin

subroutine compression_lin
do i=1,dofs_per_element
use the variables on the private list to calculate ss, dd, r_bf, q_bf

– Jin

Then how to set the “value” attribute?

The Fortran 2003 “value” attribute is added to a subroutine argument’s definition. It states that the variable being passed to the subroutine should be passed by value rather than the default pass by reference. Since pass by value creates a copy of the variable as opposed to directly reference the variable, there’s no possibility for depend references. Now if the argument does need to be updated, then you wouldn’t want to use “value”.

Again, I don’t know if this is the problem and it would only be an issue if the passed in scalar variables were in a module or common block.

It would look something like:

subroutine compression_lin(mu79_l,nu79_l, ss_l,dd_l,r_bf_l,q_bf_l,advfield,izone)
real, dimension(MAX_PTS, OP_NUM) :: u79_l,nu79_l, ss_l,dd_l,r_bf_l,q_bf_l
real, value :: advfield
integer, value :: izone


I modified the code to use the “value” attribute, and declared a set of local variables corresponding to all the variables on the private list, such as

real, dimension(MAX_PTS) :: r_79_l
!$acc update device(r_79_l,…
!$acc parallel loop num_gangs(64) &
!$acc private(ss_l,dd_l,r_bf_l,q_bf_l,mu79_l,nu79_l, &
!$acc r_79_l,…

But it failed at runtime with the following error message:

Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.0, threadid=1
host:0x12bba380 device:0x200093200b00 size:116 presentcount:0+1 line:-1 name:_gyro_21
host:0x13268e80 device:0x200093201400 size:4 presentcount:0+1 line:-1 name:_basic_21
host:0x13a29ae0 device:0x20009c0a5a00 size:576 presentcount:1+0 line:4877 name:r_bf_l
host:0x13a33f20 device:0x20009c0a5e00 size:576 presentcount:1+0 line:4877 name:q_bf_l
host:0x13ab7be0 device:0x20009c0a1e00 size:7488 presentcount:1+0 line:4877 name:ss_l
host:0x13b3d460 device:0x20009c0a3c00 size:7488 presentcount:1+0 line:4877 name:dd_l
host:0x13b3f1a0 device:0x20009c181800 size:12480 presentcount:1+0 line:4889 name:nu79_l
host:0x13b42260 device:0x20009c0a6200 size:898560 presentcount:1+0 line:4889 name:mu79_l
host:0x14123100 device:0x200093200c00 size:4 presentcount:0+1 line:-1 name:_scorec_mesh_mod_16
host:0x14123880 device:0x200093200d00 size:1448 presentcount:0+1 line:-1 name:_nintegrate_16
host:0x14125780 device:0x200093202600 size:208 presentcount:0+1 line:-1 name:_basic_16
host:0x1421c080 device:0x200093202800 size:370080 presentcount:0+1 line:-1 name:_m3dc1_nint_16
host:0x144fcb00 device:0x20009325ce00 size:3360 presentcount:0+1 line:-1 name:_gyroviscosity_16
allocated block device:0x20009c0a1e00 size:7680 thread:1
allocated block device:0x20009c0a3c00 size:7680 thread:1
allocated block device:0x20009c0a5a00 size:1024 thread:1
allocated block device:0x20009c0a5e00 size:1024 thread:1
allocated block device:0x20009c0a6200 size:898560 thread:1
allocated block device:0x20009c181800 size:12800 thread:1

FATAL ERROR: data in update device clause was not found on device 1: name=r_79_l
file:/projects/M3DC1/jinchen/SRC/M3DC1/unstructured.jacc/ludef_t.f90 ludefvel_n line:5011

I don’t quite get what it actually says. Do you have any idea?

– Jin

Now the error is fixed. But I’m still getting the same error message: cudaLaunchKernel returned status 2: out of memory.

I’ll make a short code to reproduce the error. Hope that will help the debugging.


– Jin

Please do. I’m just making educated guesses without one.

It’s very hard to reproduce it. Now I moved the j do loop into subroutine compression_lin and enclosed compression_lin using data clause. I hope to avoid this OOM error. Here is what I did

4875 subroutine ludefvel_n(itri)
4986 !$acc update device(r_79,r2_79,r3_79,ri_79,ri2_79,ri3_79,ri4_79,ri5_79,ri6_79,ri7_79,ri8_79,b2i79,ps179,bz179,ph179,vz179,ch179,pst79,bzt79,pt79,rho79,vis79,vic79,vip79,for79,sir79,bfp079,bfp179,ps079,bz079,ph079,vz079,ch079,pss79,bzs79,bzx79,psx79,bfpx79,pstx79,bztx79,surface_int,weight_79,norm79,npoints,bdf)
4987 !$acc data copyin(mu79,nu79) copyout(ss,dd,r_bf,q_bf)
5000 call compression_lin(mu79,nu79, &
5001 ss,dd,r_bf,q_bf,advfield,izone)
5007 !$acc end data
1111 subroutine compression_lin(trialx, linx, ssterm, ddterm, r_bf, q_bf, advfield, &
1112 izone)
1178 !$acc parallel loop gang &
1179 !$acc private(ssterm,ddterm,q_bf,r_bf,trialx,linx,tempx,trial,lin,temp, &
1180 !$acc ltemp79a, ltemp79b, ltemp79c, ltemp79d, ltemp79e, ltemp79f, &
1181 !$acc r_79,r2_79,r3_79,ri_79,ri2_79,ri3_79,ri4_79,ri5_79,ri6_79,ri7_79,ri8_79,&
1182 !$acc b2i79,ps179,bz179,ph179,vz179,ch179,pst79,bzt79,pt79,rho79,&
1183 !$acc vis79,vic79,vip79,for79,sir79,bfp079,bfp179,ps079,bz079,ph079,vz079,&
1184 !$acc ch079,pss79,bzs79,bzx79,psx79,bfpx79,pstx79,bztx79,&
1185 !$acc weight_79,norm79)
1186 do j=1, dofs_per_element
1590 !$acc loop vector
1591 do i=1, dofs_per_element
1666 end do
1669 enddo

1670 end subroutine compression_lin

But I got the following error message from compiler:

GF90-S-0155-Invalid accelerator region: branching into or out of region is not allowed (/projects/M3DC1/jinchen/SRC/M3DC1/unstructured.acc/ludef_t.f90: 1178)
1178, Invalid accelerator region: branching into or out of region is not allowed
0 inform, 0 warnings, 1 severes, 0 fatal for compression_lin

Do you have an idea for why it complains about the parallel do loop?


– Jin

There’s something in the code like an ‘exit’, ‘stop’, or ‘goto’ that’s causing a branch out of the loop. This isn’t allowed since it creates a dependency in the loop.

Thanks. Indeed we have several “if” constructs and “return” insides them.

One more question: Which part of memory do the variables, passing as arguments into a OpenACC subroutine, take, global memory or private memory?

Sorry, but I’m not entirely sure what you’re asking. Variables are passed on the stack, but you can pass in global, private, or local variables.

I should have made it clear:

If gpu subroutine “compression_lin” is called from main, such as

       call compression_lin(mu79,nu79)

       subroutine compression_lin(mu79,nu79)

Will mu79 and nu79 reside on private memory or global memory? And which part of gpu memory that the subroutine compression_lin is offloaded onto

Isn’t “compression_lin” a host subroutine that contains an OpenACC compute region? They’re just host variables at that point.

Though since these are in an update clause, I’m presuming you have a data region at a higher level which creates the device copy of these variables. These copies would reside in the device’s global memory. When a compute region is encountered, the compiler runtime will do a “present” check which does a table look-up on the host address of the variable to find the corresponding address for the device copy then pass the device address to the kernel.

Note that “private” memory, i.e. memory only accessible by a single thread in a device kernel would be held in registers or local memory (which is stored in global memory). There’s also shared memory which is located in cache and shared by all the threads in a CUDA block (gang). For gang private variables (i.e. variables in a “private” clause on a gang-only loop) the compiler will attempt to store these in shared memory.

Thanks for your clear explanation. Now I would like to look into this issue in more detail. What’s the feature in Nsight that I should use in order to reveal the memory problem?


– Jin

Nsight-Systems and Nsight-Compute are profilers so wouldn’t be helpful to find a runtime error. Nsight-Eclipse is an IDE so you could use the debugger in there, but I personally just use cuda-gdb directly and not the IDE. Though cuda-gdb probably would help in tracking this down either.

For this, I’d suggest setting the environment variable “NVCOMPILER_ACC_DEBUG=1” and pipe stderr to a file (there can be a lot of output). Then post last few lines of the file.