Unknown 8GB memory getting allocated on GPU

Hi,
I am getting an out of memory error on the GPU just before one of my GPU kernels is launched. To investigate more, I ran the code with PGI_ACC_DEBUG=1, and found that a memory allocation request for 8GB is made on the device. But I am not sure what variable. It seemed to finish moving an array named neighbours just before the error. Below is an excerpt:

...
pgi_uacc_dataon(devptr=0x60,hostptr=0x7fa1a11a6860,offset=0,0,stride=1,18486,size=18471x27,extent=18486x27,eltsize=4,lineno=404,name=neighbours,flags=0x2700=create+present+copyin+inexact,threadid=1)
pgi_uacc_dataon( devid=1, threadid=1 ) dindex=1
NO map for host:0x7fa1a11a6860
pgi_uacc_alloc(size=1996488,devid=1,threadid=1)
pgi_uacc_alloc(size=1996488,devid=1,threadid=1) returns 0xb02500000
map    dev:0xb02500000 host:0x7fa1a11a6860 size:1996488 offset:0 data[dev:0xb02500000 host:0x7fa1a11a6860 size:1996488] (line:404 name:neighbours) dims=18486x27
alloc done with devptr at 0xb02500000
pgi_uacc_pin(devptr=0x0,hostptr=0x7fa1a11a6860,offset=0,0,stride=1,18486,size=18471x27,extent=18486x27,eltsize=4,lineno=404,name=neighbours,flags=0x0,threadid=1)
MemHostRegister( 0x7fa1a11a6860, 1996488, 0 )
pgi_uacc_dataupx(devptr=0xb02500000,hostptr=0x7fa1a11a6860,offset=0,0,stride=1,18486,size=18471x27,extent=18486x27,eltsize=4,lineno=404,name=neighbours,flags=0x0,threadid=1)
pgi_uacc_cuda_dataup2(devdst=0xb02500000,hostsrc=0x7fa1a11a6860,offset=0,0,stride=1,18486,size=18471,27,eltsize=4,lineno=404,name=neighbours)
pgi_uacc_datadone( async=-1, devid=1 )
pgi_uacc_cuda_wait(lineno=-1,async=-1,dindex=1)
pgi_uacc_cuda_wait(sync on stream=(nil))
pgi_uacc_cuda_wait done
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb0246c600
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02700000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb0276c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02800000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb0286c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02900000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb0296c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02a00000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02a6c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02b00000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02b6c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02c00000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02c6c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02d00000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02d6c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02e00000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02e6c400
pgi_uacc_alloc(size=8194917744,devid=1,threadid=1)
call to cuMemAlloc returned error 2: Out of memory
  P0

I ran it step by step using cuda-gdb. I got the memory error again, but no new message that shed more light on the cause of the 8GB allocation. Below is an excerpt from cuda-gdb:

[Launch of CUDA Kernel 1 (calc_force_des_150_gpu<<<(18471,1,1),(256,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 2 (calc_force_des_180_gpu_red<<<(1,1,1),(256,1,1)>>>) on Device 0]
[Termination of CUDA Kernel 1 (calc_force_des_150_gpu<<<(18471,1,1),(256,1,1)>>>) on Device 0]

Breakpoint 1, calc_force_des () at calc_force_des.f:404
404     !$acc parallel
(cuda-gdb) s
__pgi_uacc_dataon (filename=0xae5560 "/lvol/home/anirban/mfix/mfix.0011b/model/./des/calc_force_des.f",
    funcname=0xae55a0 "calc_force_des", pdevptr=0x7fffffffd920, hostptr=0x7fffefd96020, dims=2, desc=0x7fffffffd310, elementsize=8,
    lineno=404, name=0xae59d7 "des_pos_new", flags=9984, async=-1, devid=1) at dataon.c:39
39      dataon.c: No such file or directory.
        in dataon.c
(cuda-gdb) s
43      in dataon.c
(cuda-gdb) s
48      in dataon.c
(cuda-gdb) s
50      in dataon.c
(cuda-gdb) s
52      in dataon.c
(cuda-gdb)
54      in dataon.c
(cuda-gdb)
55      in dataon.c
(cuda-gdb)
56      in dataon.c
(cuda-gdb)
57      in dataon.c
(cuda-gdb)
64      in dataon.c
(cuda-gdb)
66      in dataon.c
(cuda-gdb)
69      in dataon.c
(cuda-gdb)
74      in dataon.c
(cuda-gdb)
77      in dataon.c
(cuda-gdb)
__pgi_uacc_adjust (pdims=0x7fffffffcebc, desc=0x7fffffffd310) at adjust.c:31
31      adjust.c: No such file or directory.
        in adjust.c
(cuda-gdb)
32      in adjust.c
(cuda-gdb)
34      in adjust.c
(cuda-gdb)
36      in adjust.c
(cuda-gdb)
37      in adjust.c
(cuda-gdb)
42      in adjust.c
(cuda-gdb)
43      in adjust.c
(cuda-gdb)
44      in adjust.c
(cuda-gdb)
50      in adjust.c
(cuda-gdb)
56      in adjust.c
(cuda-gdb)
34      in adjust.c
(cuda-gdb)
36      in adjust.c
(cuda-gdb)
37      in adjust.c
(cuda-gdb)
42      in adjust.c
(cuda-gdb)
43      in adjust.c
(cuda-gdb)
44      in adjust.c
(cuda-gdb)
50      in adjust.c
(cuda-gdb)
56      in adjust.c
(cuda-gdb)
34      in adjust.c
(cuda-gdb)
67      in adjust.c
(cuda-gdb)
68      in adjust.c
(cuda-gdb)
78      in adjust.c
(cuda-gdb)
82      in adjust.c
(cuda-gdb)
84      in adjust.c
(cuda-gdb)
85      in adjust.c
(cuda-gdb)
86      in adjust.c
(cuda-gdb)
87      in adjust.c
(cuda-gdb)
67      in adjust.c
(cuda-gdb)
92      in adjust.c
(cuda-gdb)
93      in adjust.c
(cuda-gdb)
94      in adjust.c
(cuda-gdb)
__pgi_uacc_dataon (filename=0xae5560 "/lvol/home/anirban/mfix/mfix.0011b/model/./des/calc_force_des.f",
    funcname=0xae55a0 "calc_force_des", pdevptr=0x7fffffffd920, hostptr=0x7fffefd96020, dims=1, desc=0x7fffffffd310, elementsize=8,
    lineno=404, name=0xae59d7 "des_pos_new", flags=9984, async=-1, devid=1) at dataon.c:78
78      dataon.c: No such file or directory.
        in dataon.c
(cuda-gdb)
87      in dataon.c
(cuda-gdb)
88      in dataon.c
(cuda-gdb)
92      in dataon.c
(cuda-gdb)

Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0x7fffe8943700 (LWP 14598)]
0x00000036fe8de2f3 in select () from /lib64/libc.so.6
(cuda-gdb)
Single stepping until exit from function select,
which has no line number information.

warning: Cuda API error detected: cuMemAlloc_v2 returned (0x2)

call to cuMemAlloc returned error 2: Out of memory
[Thread 0x7fffe8943700 (LWP 14598) exited]

Program exited with code 01.
[Termination of CUDA Kernel 2 (calc_force_des_180_gpu_red<<<(1,1,1),(256,1,1)>>>) on Device 0]
[Termination of CUDA Kernel 0 (desgrid_neigh_build_gpu_507_gpu<<<(145,1,1),(128,1,1)>>>) on Device 0]
(cuda-gdb)
The program is not being run.

[Launch of CUDA Kernel 1 (calc_force_des_150_gpu<<<(18471,1,1),(256,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 2 (calc_force_des_180_gpu_red<<<(1,1,1),(256,1,1)>>>) on Device 0]
[Termination of CUDA Kernel 1 (calc_force_des_150_gpu<<<(18471,1,1),(256,1,1)>>>) on Device 0]

Breakpoint 1, calc_force_des () at calc_force_des.f:404
404     !$acc parallel
(cuda-gdb) s
__pgi_uacc_dataon (filename=0xae5560 "/lvol/home/anirban/mfix/mfix.0011b/model/./des/calc_force_des.f",
    funcname=0xae55a0 "calc_force_des", pdevptr=0x7fffffffd920, hostptr=0x7fffefd96020, dims=2, desc=0x7fffffffd310, elementsize=8,
    lineno=404, name=0xae59d7 "des_pos_new", flags=9984, async=-1, devid=1) at dataon.c:39
39      dataon.c: No such file or directory.
        in dataon.c
(cuda-gdb) s
43      in dataon.c
(cuda-gdb) s
48      in dataon.c
(cuda-gdb) s
50      in dataon.c
(cuda-gdb) s
52      in dataon.c
(cuda-gdb)
54      in dataon.c
(cuda-gdb)
55      in dataon.c
(cuda-gdb)
56      in dataon.c
(cuda-gdb)
57      in dataon.c
(cuda-gdb)
64      in dataon.c
(cuda-gdb)
66      in dataon.c
(cuda-gdb)
69      in dataon.c
(cuda-gdb)
74      in dataon.c
(cuda-gdb)
77      in dataon.c
(cuda-gdb)
__pgi_uacc_adjust (pdims=0x7fffffffcebc, desc=0x7fffffffd310) at adjust.c:31
31      adjust.c: No such file or directory.
        in adjust.c
(cuda-gdb)
32      in adjust.c
(cuda-gdb)
34      in adjust.c
(cuda-gdb)
36      in adjust.c
(cuda-gdb)
37      in adjust.c
(cuda-gdb)
42      in adjust.c
(cuda-gdb)
43      in adjust.c
(cuda-gdb)
44      in adjust.c
(cuda-gdb)
50      in adjust.c
(cuda-gdb)
56      in adjust.c
(cuda-gdb)
34      in adjust.c
(cuda-gdb)
36      in adjust.c
(cuda-gdb)
37      in adjust.c
(cuda-gdb)
42      in adjust.c
(cuda-gdb)
43      in adjust.c
(cuda-gdb)
44      in adjust.c
(cuda-gdb)
50      in adjust.c
(cuda-gdb)
56      in adjust.c
(cuda-gdb)
34      in adjust.c
(cuda-gdb)
67      in adjust.c
(cuda-gdb)
68      in adjust.c
(cuda-gdb)
78      in adjust.c
(cuda-gdb)
82      in adjust.c
(cuda-gdb)
84      in adjust.c
(cuda-gdb)
85      in adjust.c
(cuda-gdb)
86      in adjust.c
(cuda-gdb)
87      in adjust.c
(cuda-gdb)
67      in adjust.c
(cuda-gdb)
92      in adjust.c
(cuda-gdb)
93      in adjust.c
(cuda-gdb)
94      in adjust.c
(cuda-gdb)
__pgi_uacc_dataon (filename=0xae5560 "/lvol/home/anirban/mfix/mfix.0011b/model/./des/calc_force_des.f",
    funcname=0xae55a0 "calc_force_des", pdevptr=0x7fffffffd920, hostptr=0x7fffefd96020, dims=1, desc=0x7fffffffd310, elementsize=8,
    lineno=404, name=0xae59d7 "des_pos_new", flags=9984, async=-1, devid=1) at dataon.c:78
78      dataon.c: No such file or directory.
        in dataon.c
(cuda-gdb)
87      in dataon.c
(cuda-gdb)
88      in dataon.c
(cuda-gdb)
92      in dataon.c
(cuda-gdb)

Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0x7fffe8943700 (LWP 14598)]
0x00000036fe8de2f3 in select () from /lib64/libc.so.6
(cuda-gdb)
Single stepping until exit from function select,
which has no line number information.

warning: Cuda API error detected: cuMemAlloc_v2 returned (0x2)

call to cuMemAlloc returned error 2: Out of memory
[Thread 0x7fffe8943700 (LWP 14598) exited]

Program exited with code 01.
[Termination of CUDA Kernel 2 (calc_force_des_180_gpu_red<<<(1,1,1),(256,1,1)>>>) on Device 0]
[Termination of CUDA Kernel 0 (desgrid_neigh_build_gpu_507_gpu<<<(145,1,1),(128,1,1)>>>) on Device 0]
(cuda-gdb)
The program is not being run.

Finally, running the code with cuda-memcheck causes a hang. “ps x” shows the foll:

...
14455 pts/0    S+     0:00 cuda-memcheck mfix.exe
14456 pts/0    Rl+   14:17 mfix.exe 
...

Any advice on how to proceed to fix this will be great.
Thanks much
Anirban

Hi Anirban,

Can you post the OpenACC directives you’re using at line 150 of calc_force_des.f? Are you privatizing any arrays?

My guess is that you’re privatizing a large array which will create a unique copy for each thread.

  • Mat

Hi Mat,
You are right! I am privatizing some arrays. In fact, I have a 3-level nested DO loop where the outer loop is large, being over the # particles, and the inner ones are much smaller, but I am asking the compiler to execute the inner two loops sequentially. I have posted the code below (starts at Line 404, in fact the loop at Line 150 is fine, as the code works when I comment the ACC directives around the loop starting at Line 404):

!$acc data copy(tmp_ax(dimn))
!$acc parallel
!$acc loop  private(LL, PFT_TMP, DIST, DIST_CL, DIST_CI, OMEGA_SUM, VRELTRANS, V_ROT, NORMAL, TANGENT, VSLIP, &
!$acc&              DTSOLID_TMP, DIST_OLD, VRN_OLD, NORMAL_OLD, sigmat, sigmat_old, FNS1, FNS2, FN, norm_old) &
!$acc&              reduction(+: DIST0_EVENTS, NEG_NORM_VEL_1ST_CONTACT_EVENTS, EXCESSIVE_OVERLAP_EVENTS)
! phase_I, phase_LL

      DO LL=1,MAX_PIP
         IF(.NOT.PEA(LL,1) .OR. PEA(LL,4) ) CYCLE

         PFT_TMP(:) = ZERO
         PARTICLE_SLIDE = .FALSE.

! Check particle LL neighbour contacts
!---------------------------------------------------------------------

!$acc loop seq
            DO II = 2, NEIGHBOURS(LL,1)+1
               I = NEIGHBOURS(LL,II)
               IF(PEA(I,1)) THEN
                  ALREADY_IN_CONTACT =.FALSE.

                  DO NEIGH_L = 2, PN(LL,1)+1
                     IF(I.EQ. PN(LL,NEIGH_L)) THEN
                        ALREADY_IN_CONTACT =.TRUE.
                        NI = NEIGH_L
                        EXIT
                     ENDIF
                  ENDDO

......
......

 300           CONTINUE
            ENDDO            ! DO II = 2, NEIGHBOURS(LL,1)+I

!---------------------------------------------------------------------
! End check particle LL neighbour contacts

      ENDDO   ! end loop over paticles LL checking particle- particle contact
!$acc end parallel
!$acc end data

I have also attached the relevant compiler output below.

pgf90    -O -Mdalign -acc -ta=nvidia,time -Minfo=inline,accel -Munixlogical -c -I. -Mnosave -Mfreeform -Mrecursive -Mreentrant -byteswapio -Minline=name:des_crossprdct_2d,name:des_crossprdct_3d  ./des/calc_force_des.f 
calc_force_des:
    149, Generating copy(pn(1:max_pip,1:maxneighbors))
         Generating copy(pv(1:max_pip,1:maxneighbors))
         Generating copy(pfn(1:max_pip,1:maxneighbors,1:dimn))
         Generating copy(pft(1:max_pip,1:maxneighbors,1:dimn))
         Generating copy(pea(1:max_pip,1:4))
    150, Accelerator kernel generated
        152, !$acc loop gang ! blockidx%x
        166, !$acc loop vector(256) ! threadidx%x
        168, !$acc loop vector(256) ! threadidx%x
        177, !$acc loop vector(256) ! threadidx%x
        179, !$acc loop vector(256) ! threadidx%x
        186, !$acc loop vector(256) ! threadidx%x
        188, !$acc loop vector(256) ! threadidx%x
        195, !$acc loop vector(256) ! threadidx%x
    150, Generating present_or_copy(pea(1:max_pip,1:4))
         Generating present_or_copy(pn(1:max_pip,1:maxneighbors))
         Generating present_or_copy(pfn(1:max_pip,1:maxneighbors,1:dimn))
         Generating present_or_copy(pv(1:max_pip,1:maxneighbors))
         Generating present_or_copy(pft(1:max_pip,1:maxneighbors,1:dimn))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    152, Scalar last value needed after loop for 'ni' at line 561
         Scalar last value needed after loop for 'ni' at line 644
         Scalar last value needed after loop for 'ni' at line 649
         Scalar last value needed after loop for 'ni' at line 647
         Scalar last value needed after loop for 'ni' at line 515
         Scalar last value needed after loop for 'ni' at line 328
         Scalar last value needed after loop for 'ni' at line 381
         Scalar last value needed after loop for 'ni' at line 387
         Scalar last value needed after loop for 'ni' at line 385
         Scalar last value needed after loop for 'ni' at line 287
    157, Complex loop carried dependence of 'pn' prevents parallelization
         Loop carried dependence due to exposed use of 'pn(i1+1,1:maxneighbors)' prevents parallelization
         Loop carried dependence of 'pn' prevents parallelization
         Loop carried backward dependence of 'pn' prevents vectorization
         Loop carried scalar dependence for 'shift' at line 160
         Loop carried dependence of 'pft' prevents parallelization
         Loop carried backward dependence of 'pft' prevents vectorization
         Loop carried dependence due to exposed use of 'pft(i1+1,1:maxneighbors,:)' prevents parallelization
         Loop carried dependence of 'pfn' prevents parallelization
         Loop carried backward dependence of 'pfn' prevents vectorization
         Loop carried dependence due to exposed use of 'pfn(i1+1,1:maxneighbors,:)' prevents parallelization
    166, Loop is parallelizable
    168, Loop is parallelizable
    176, Loop carried dependence due to exposed use of 'pn(i1+1,1:maxneighbors)' prevents parallelization
    177, Loop is parallelizable
         Loop carried reuse of 'pft' prevents parallelization
    179, Loop is parallelizable
         Loop carried reuse of 'pfn' prevents parallelization
    180, Max reduction generated for neigh_max
    186, Loop is parallelizable
    188, Loop is parallelizable
    195, Loop is parallelizable
    407, Generating copy(tmp_ax(:dimn))
    408, Accelerator kernel generated
        414, !$acc loop gang ! blockidx%x
        417, !$acc loop vector(256) ! threadidx%x
        439, !$acc loop vector(256) ! threadidx%x
        456, !$acc loop vector(256) ! threadidx%x
        468, !$acc loop vector(256) ! threadidx%x
        488, !$acc loop vector(256) ! threadidx%x
        495, !$acc loop vector(256) ! threadidx%x
        501, !$acc loop vector(256) ! threadidx%x
        503, !$acc loop vector(256) ! threadidx%x
        526, !$acc loop vector(256) ! threadidx%x
        528, !$acc loop vector(256) ! threadidx%x
        561, !$acc loop vector(256) ! threadidx%x
        563, !$acc loop vector(256) ! threadidx%x
        577, !$acc loop vector(256) ! threadidx%x
        582, !$acc loop vector(256) ! threadidx%x
        586, !$acc loop vector(256) ! threadidx%x
        593, !$acc loop vector(256) ! threadidx%x
        597, !$acc loop vector(256) ! threadidx%x
        603, !$acc loop vector(256) ! threadidx%x
        607, !$acc loop vector(256) ! threadidx%x
        616, !$acc loop vector(256) ! threadidx%x
        618, !$acc loop vector(256) ! threadidx%x
        622, !$acc loop vector(256) ! threadidx%x
        629, !$acc loop vector(256) ! threadidx%x
        630, !$acc loop vector(256) ! threadidx%x
        641, !$acc loop vector(256) ! threadidx%x
        644, !$acc loop vector(256) ! threadidx%x
        647, !$acc loop vector(256) ! threadidx%x
        649, !$acc loop vector(256) ! threadidx%x
    408, Generating present_or_copyin(des_pos_new(:,:dimn+des_pos_new$sd-1))
         Generating present_or_copyin(pea(:,1:4))
         Generating present_or_copy(pn(1:max_pip,:))
         Generating present_or_copy(fc(1:max_pip,:))
         Generating copyin(fn_tmp(:))
         Generating copyout(fn_tmp(:dimn))
         Generating present_or_copy(tang_old(:3))
         Generating present_or_copy(pft(1:max_pip,:,:))
         Generating present_or_copy(pfn(1:max_pip,:,:))
         Generating present_or_copyin(des_etat(:,:))
         Generating present_or_copyin(des_etan(:,:))
         Generating present_or_copyin(hert_kt(:,:))
         Generating present_or_copyin(hert_kn(:,:))
         Generating present_or_copyin(des_vel_old(:,:dimn+des_vel_old$sd-1))
         Generating present_or_copyin(des_pos_old(:,:dimn+des_pos_old$sd-1))
         Generating present_or_copyin(omega_new(:,:dimn+omega_new$sd-1))
         Generating present_or_copyin(des_radius(:))
         Generating present_or_copyin(des_vel_new(:,:dimn+des_vel_new$sd-1))
         Generating present_or_copy(pv(1:max_pip,:))
         Generating present_or_copyin(pijk(:,5))
         Generating present_or_copy(tmp_ax(:dimn))
         Generating copyin(tang_new(:))
         Generating copyout(tang_new(:3))
         Generating present_or_copy(ft(1:max_pip,:))
         Generating copyin(fts1(:))
         Generating copyout(fts1(:dimn))
         Generating copyin(fts2(:))
         Generating copyout(fts2(:dimn))
         Generating copyin(ft_tmp(:))
         Generating copyout(ft_tmp(:dimn))
         Generating present_or_copy(tow(1:max_pip,:))
         Generating copyin(crossp(:))
         Generating copyout(crossp(:3))
         Generating present_or_copyin(neighbours(1:max_pip,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    414, Accelerator restriction: scalar variable live-out from loop: particle_slide
         Accelerator restriction: scalar variable live-out from loop: v_rel_trans_tang
         Accelerator restriction: scalar variable live-out from loop: v_rel_trans_norm
         Accelerator restriction: scalar variable live-out from loop: distmod
    417, Loop is parallelizable
    424, Accelerator restriction: scalar variable live-out from loop: particle_slide
         Accelerator restriction: scalar variable live-out from loop: v_rel_trans_tang
         Accelerator restriction: scalar variable live-out from loop: v_rel_trans_norm
         Accelerator restriction: scalar variable live-out from loop: distmod
    431, Accelerator restriction: induction variable live-out from loop: neigh_l
    436, Accelerator restriction: induction variable live-out from loop: neigh_l
    439, Loop is parallelizable
    444, Max reduction generated for overlap_max
    456, Loop is parallelizable
    468, Loop is parallelizable
    480, des_crossprdct_3d inlined, size=8, file ./des/calc_force_des.f (827)
    483, des_crossprdct_2d inlined, size=4, file ./des/calc_force_des.f (805)
    488, Loop is parallelizable
    495, Loop is parallelizable
    501, Loop is parallelizable
    503, Loop is parallelizable
    526, Loop is parallelizable
    528, Loop is parallelizable
    561, Loop is parallelizable
    563, Loop is parallelizable
    574, des_crossprdct_3d inlined, size=8, file ./des/calc_force_des.f (827)
    577, Loop is parallelizable
    579, des_crossprdct_3d inlined, size=8, file ./des/calc_force_des.f (827)
    581, des_crossprdct_3d inlined, size=8, file ./des/calc_force_des.f (827)
    582, Loop is parallelizable
    586, Loop is parallelizable
    593, Loop is parallelizable
    597, Loop is parallelizable
    603, Loop is parallelizable
    607, Loop is parallelizable
    616, Loop is parallelizable
    618, Loop is parallelizable
    622, Loop is parallelizable
    629, Loop is parallelizable
    630, Loop is parallelizable
    635, des_crossprdct_3d inlined, size=8, file ./des/calc_force_des.f (827)
    641, Loop is parallelizable
    644, Loop is parallelizable
    647, Loop is parallelizable
    649, Loop is parallelizable
    692, Accelerator kernel generated
        694, !$acc loop gang ! blockidx%x
        702, !$acc loop vector(256) ! threadidx%x
        718, !$acc loop vector(256) ! threadidx%x
        734, !$acc loop vector(256) ! threadidx%x
        752, !$acc loop vector(256) ! threadidx%x
    692, Generating present_or_copyin(des_pos_new(:,:dimn+des_pos_new$sd-1))
         Generating present_or_copyin(des_radius(:))
         Generating present_or_copyin(pea(:,1:4))
         Generating present_or_copy(fcohesive(1:max_pip,1:dimn))
         Generating present_or_copyin(w_pos_l(1:max_pip,1:nwalls,1:dimn))
         Generating copyin(dist(:))
         Generating copyout(dist(:dimn))
         Generating present_or_copyin(neighbours(1:max_pip,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    700, Loop is parallelizable
    702, Loop is parallelizable
    718, Loop is parallelizable
    729, Loop is parallelizable
    734, Loop is parallelizable
    752, Loop is parallelizable

Finally, how to handle FORTRAN vector assignments which I think are treated as tiny (3x1) loops, like

DIST(:) = DES_POS_NEW(I,:) - DES_POS_NEW(LL,:)

Should I have to put

!$acc loop seq

before each of these?
And I’m not clear why vector(256) is used by the compiler to parallelize these tiny loops.

I can send the full code by any means you suggest.

Thanks much for your advice.
Anirban

Quick comment … the innermost DO loop is also designated as sequential … I posted a slightly older version of the code above where it isn’t.

!$acc loop seq
                 DO NEIGH_L = 2, PN(LL,1)+1
                     IF(I.EQ. PN(LL,NEIGH_L)) THEN
                        ALREADY_IN_CONTACT =.TRUE.
                        NI = NEIGH_L
                        EXIT
                     ENDIF
                  ENDDO

!$acc loop private(LL, PFT_TMP, DIST, DIST_CL, DIST_CI, OMEGA_SUM, VRELTRANS, V_ROT, NORMAL, TANGENT, VSLIP, &
!$acc& DTSOLID_TMP, DIST_OLD, VRN_OLD, NORMAL_OLD, sigmat, sigmat_old, FNS1, FNS2, FN, norm_old) &

How are these variables declared? Are any other large arrays?

When you privatize a variable, every thread will get it’s own copy. Since you have over 18,000 threads, if any of these are larger arrays, then you’ll be using up a lot of memory.

If your algorithm requires the use of private, then your other option is to use a fixed gang and vector size (i.e. use “num_gangs” and “vector_length”).

  • Mat

Hi Mat,
These variables are all 3X1 arrays of type double, except the loop index LL which is a scalar. The list contains 20 arrays, and with 18486 copies of each, the total space needed should be a mere 8.5 MB (=2083*18486/1024/1024). There are some more large shared arrays, but I have estimated that the storage requirement should be of the order of tens of MBs, nowhere close to 8GB!

In fact, I have reverse engineered where the big number 8194917744 (~8GB) is coming from. 8194917744=184861847124 !!! So a copy each of a 3x1 double array is being attempted for a grid of 18486*18471 threads. But since my inner loops are much smaller (as well as the fact that I have made them sequential), I can’t see what this large grid of threads is associated with. May be I am missing sth (it is a complex piece of code). Is there a way to generate more illuminating diagnostic messages? Can I get the code to print an error message that says exactly what variable or loop the large memory allocation is associated with?

Thanks very much
Anirban

May be I am missing sth (it is a complex piece of code). Is there a way to generate more illuminating diagnostic messages?

You can set the environment variable “PGI_ACC_DEBUG=1”.

  • Mat

Hi Mat,
I already did that. Excerpt of the output of PGI_ACC_DEBUG as well as from cuda-gdb is in my original post that started this thread. The issue seems to occur right after the shared array “neighbor” was copied from host to device, but I am not sure if “neighbor” is the issue or sth else.

Can you see additional clues in the posted output?

Thanks a lot
Anirban

Hi Anirban,

Are you able to share the code? If I can reproduce the issue, it will give me a much better understanding of the problem.

Though, some other things for you try is to explicitly set the outer loop schedule to “gang, vector”, and possibly use fixed widths (num_gangs, vector_length) to reduce the number of threads. It doesn’t explain where the 8GB is coming from, but if the 8GB reduces, then at least we know it’s one of the private arrays.

Also, try compiling with “-acc=noautopar” so the compiler doesn’t auto vectorize the inner loops. My thought here is that

  • Mat

Hi Mat,
Sure, let me know how to share it. The code is MFIX, which you mentioned earlier that you have access to. So I can send you just the file calc_force_des.f I am modifying, as well as the input/restart data.

I am trying out the fixed gang/worker/vector configuration. I do think it is one of the private arrays (probably des_pos_new, as per the cuda-gdb output in the first post?). But why it is copied 18Kx18K times is the mystery I would like to resolve.

Thanks
Anirban

If it can be sent via email, please send it to trs@pgroup.com. Otherwise, ftp it (See: https://www.pgroup.com/support/ftp_access.php).

I do have MFIX here. Looks like I downloaded it last December but hopefully is close enough to your version.

  • Mat