Parallelizing a loop

Hello,

I have a loop parallelized as:

3460
3461 !$acc data copyin(dt, rho_zz_old(1:nVertLevels, 1:nCellsSolve)), &
3462 !$acc copy(qv_old(1:nVertLevels, 1:nCellsSolve)), &
3463 !$acc copy(qc_old(1:nVertLevels, 1:nCellsSolve)), &
3464 !$acc copy(qr_old(1:nVertLevels, 1:nCellsSolve)), &
3465 !$acc copy(qi_old(1:nVertLevels, 1:nCellsSolve)), &
3466 !$acc copy(qs_old(1:nVertLevels, 1:nCellsSolve)), &
3467 !$acc copy(qg_old(1:nVertLevels, 1:nCellsSolve)), &
3468 !$acc copy(qv_tnd(1:nVertLevels, 1:nCellsSolve)), &
3469 !$acc copy(qc_tnd(1:nVertLevels, 1:nCellsSolve)), &
3470 !$acc copy(qr_tnd(1:nVertLevels, 1:nCellsSolve)), &
3471 !$acc copy(qi_tnd(1:nVertLevels, 1:nCellsSolve)), &
3472 !$acc copy(qs_tnd(1:nVertLevels, 1:nCellsSolve)), &
3473 !$acc copy(qg_tnd(1:nVertLevels, 1:nCellsSolve))
3474 !$acc parallel loop gang vector collapse(2) independent private(inv_rho_zz_old)
3475 do iCell = 1, nCellsSolve
3476 do k = 1, nVertLevels
3477 inv_rho_zz_old = 1.0_RKIND / rho_zz_old(k,iCell)
3478
3479 qv_old(k,iCell) = qv_old(k,iCell)+dt*qv_tnd(k,iCell)inv_rho_zz_old
3480 qc_old(k,iCell) = qc_old(k,iCell)+dt
qc_tnd(k,iCell)inv_rho_zz_old
3481 qr_old(k,iCell) = qr_old(k,iCell)+dt
qr_tnd(k,iCell)inv_rho_zz_old
3482 qi_old(k,iCell) = qi_old(k,iCell)+dt
qi_tnd(k,iCell)inv_rho_zz_old
3483 qs_old(k,iCell) = qs_old(k,iCell)+dt
qs_tnd(k,iCell)inv_rho_zz_old
3484 qg_old(k,iCell) = qg_old(k,iCell)+dt
qg_tnd(k,iCell)*inv_rho_zz_old
3485
3486 qv_tnd(k,iCell) = 0.0_RKIND
3487 qc_tnd(k,iCell) = 0.0_RKIND
3488 qr_tnd(k,iCell) = 0.0_RKIND
3489 qi_tnd(k,iCell) = 0.0_RKIND
3490 qs_tnd(k,iCell) = 0.0_RKIND
3491 qg_tnd(k,iCell) = 0.0_RKIND
3492 end do
3493 end do
3494 !$acc end parallel
3495 !$acc end data
3496

It compiled OK, and have problem to run it.
The message is:

3474: compute region reached 1 time
3474: kernel launched 1 time
grid: [4382] block: [128]
device time(us): total=0 max=0 min=0 avg=0


The MPI aborted with error:

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 46871 RUNNING AT htdgfxl07.universalweather.rdn
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES



It runs fine if this loop does not have OpenACC directives.

Any idea?

Thanks,

Wei

Hi Wei,

Unfortunately, there’s not much to go on.

Were there any other messages before the profile? You might try turning off the profile info by not setting PGI_ACC_TIME to see if that better shows the error.

Another thing to try is setting “PGI_ACC_DEBUG=1”. This will dump information about every call to the OpenACC runtime. It can be a lot of info but can he helpful in getting an idea of where the error is coming from.

Once you can get what the actual error is, then we can investigate why it’s occurring.

  • Mat

Mat,

Here are the info with PGI_ACC_DEBUG=1:

src/core_atmosphere/dynamics/mpas_atm_time_integration.F
atm_advance_scalars_mono NVIDIA devicenum=0
time(us): 19,207
3461: data region reached 1 time
3461: data copyin transfers: 27
device time(us): total=19,207 max=1,436 min=44 avg=711
3474: compute region reached 1 time
3474: kernel launched 1 time
grid: [4382] block: [128]
device time(us): total=0 max=0 min=0 avg=0

The function info:

Function 11 = 0x11b1f40 Function 11 = 0x11b1f40 = atm_advance_scalars_mono_3474_gpu
3474 = lineno
128x1x1 = block size
-1x1x1 = grid size
0x1 = config_flag = divx(0)
1x1x1 = unroll
0 = shared memory
0 = reduction shared memory
0 = reduction arg
0 = reduction bytes
336 = argument bytes
0 = max argument bytes
1 = size arguments

pgi_uacc_computestart( file=/home/whuang/pgi/srcs/reorg_mpas_scalars/src/core_atmosphere/dynamics/mpas_atm_time_integration.F, function=atm_advance_scalars_mono, line=3330:3330, line=3474, devid=0 )
pgi_uacc_launch funcnum=11 argptr=0x7ffdd2242320 sizeargs=0x7ffdd2242310 async=-1 devid=1
Arguments to function 11 atm_advance_scalars_mono_3474_gpu dindex=1 threadid=1 device=0:
41 0 161218560 66 71565312 66 36455936 66
66715648 66 47316992 66 81264640 66 76414976 66
0x00000029 0x00000000 0x099c0000 0x00000042 0x04440000 0x00000042 0x022c4600 0x00000042
0x03fa0000 0x00000042 0x02d20000 0x00000042 0x04d80000 0x00000042 0x048e0000 0x00000042
Launch configuration for function=11=atm_advance_scalars_mono_3474_gpu line=3474 dindex=1 threadid=1 device=0 <<<(4382,1,1),(128,1,1),0>>>


At the end it has:

call to cuMemFreeHost returned error 700: Illegal address during kernel execution


Another interesting thing:
I tried change “parallel” on line 3474/3494 to “kernels”,
this loop worked.

Thanks,

Wei

Hi Wei,

I’m curious what the compiler feedback messages are (-Minfo=accel) for both the “parallel” and “kernels” versions.

Also, what happens if you remove “private(inv_rho_zz_old)”? It’s not needed here since scalars are private by default and more important, will be declared locally within the kernel versus global memory. I doubt it’s the cause of your error, but worth a try.

  • Mat

Mat,

Here are the compiler msg:

  1. with kernels
    atm_advance_scalars_mono:
    3461, Generating copyin(dt,rho_zz_old(1:nvertlevels,1:ncellssolve))
    Generating copy(qv_old(1:nvertlevels,1:ncellssolve),qc_old(1:nvertlevels,1:ncellssolve),qr_old(1:nvertlevels,1:ncellssolve),qi_old(1:nvertlevels,1:ncellssolve),qs_old(1:nvertlevels,1:ncellssolve),qg_old(1:nvertlevels,1:ncellssolve),qv_tnd(1:nvertlevels,1:ncellssolve),qc_tnd(1:nvertlevels,1:ncellssolve),qr_tnd(1:nvertlevels,1:ncellssolve),qi_tnd(1:nvertlevels,1:ncellssolve),qs_tnd(1:nvertlevels,1:ncellssolve),qg_tnd(1:nvertlevels,1:ncellssolve))
    3475, Loop is parallelizable
    3476, Loop is parallelizable
    Accelerator kernel generated
    Generating Tesla code
    3475, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
    3476, ! blockidx%x threadidx%x collapsed

  2. with parallel
    atm_advance_scalars_mono:
    3461, Generating copyin(dt,rho_zz_old(1:nvertlevels,1:ncellssolve))
    Generating copy(qv_old(1:nvertlevels,1:ncellssolve),qc_old(1:nvertlevels,1:ncellssolve),qr_old(1:nvertlevels,1:ncellssolve),qi_old(1:nvertlevels,1:ncellssolve),qs_old(1:nvertlevels,1:ncellssolve),qg_old(1:nvertlevels,1:ncellssolve),qv_tnd(1:nvertlevels,1:ncellssolve),qc_tnd(1:nvertlevels,1:ncellssolve),qr_tnd(1:nvertlevels,1:ncellssolve),qi_tnd(1:nvertlevels,1:ncellssolve),qs_tnd(1:nvertlevels,1:ncellssolve),qg_tnd(1:nvertlevels,1:ncellssolve))
    3474, Accelerator kernel generated
    Generating Tesla code
    3475, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
    3476, ! blockidx%x threadidx%x collapsed

The only difference I see here is that with parallel, there is “3474,” added before Accelerator kernel generated.


Remove “private(inv_rho_zz_old)” and compile with parallel still did not work.
Works fine when compile with kernels.

Thanks,

Wei

Mat,

Here is another loop, which runs fines without OpenACC directive,
#ifdef _OPENACC
!!acc data copyin(inv_config_dt), &
!!acc copyin(qv_1(1:nVertLevels, 1:nCells)), &
!!acc copyin(qv_2(1:nVertLevels, 1:nCells)), &
!!acc copyout(rqvdynten(1:nVertLevels, 1:nCells))
!!acc parallel loop vector collapse(2) independent
do iCell = 1,nCells
do k = 1,nVertLevels
rqvdynten(k, iCell) = ( qv_2(k, iCell) - qv_1(k, iCell) ) * inv_config_dt
end do
end do
!!acc end parallel
!!acc end data
#else
rqvdynten(:,:) = ( qv_2(:,:) - qv_1(:,:) ) * inv_config_dt
#endif

It compiles fine with kernels, or parallel by turning the directives on (change !!acc to !$acc). But both failed to run.

This loop looks so simple, it really makes me frustrated.

Any suggestions?

Thanks,

Wei

Hi Wei,

I’m going to need a reproducing example. I understand that your code is confidential, but could you try and extract the failing case, obfuscate the code, and send it to PGI Customer Service (trs@pgroup.com)?

I agree that the code does look simple and something that’s been done many times before. Hence, there’s something else going on.

Note I have seen instances where one part of stomping on memory in one part of the code causing a failure in a completely different section of code. I have no idea if that’s the case here, but one thing to try is turn off all OpenACC code, except this bit and see if it still fails.

  • Mat

Mat,

It is kind of hard to get a simple reproducible code.

I’ll try what you suggested: Just turn on directives fr this loop, and see how things going.

Thanks,

Wei

Mat,

I tried to create a short program just have the loop, as:

program tst
implicit none

#ifdef SINGLE_PRECISION
integer, parameter :: RKIND = selected_real_kind(6)
#else
integer, parameter :: RKIND = selected_real_kind(12)
#endif

integer, parameter :: nVertLevels = 41
integer, parameter :: nCells = 163842
integer, parameter :: nCellsSolve = 13679

integer :: iCell, k

real (kind=RKIND), dimension(:,:), pointer :: rho_zz_old
real (kind=RKIND), dimension(:,:), pointer :: qv_old, qc_old, qr_old, qi_old, qs_old, qg_old
real (kind=RKIND), dimension(:,:), pointer :: qv_tnd, qc_tnd, qr_tnd, qi_tnd, qs_tnd, qg_tnd
real (kind=RKIND) :: inv_rho_zz_old, dt

allocate(rho_zz_old(nVertLevels, nCells))
allocate(qv_old(nVertLevels, nCells))
allocate(qc_old(nVertLevels, nCells))
allocate(qr_old(nVertLevels, nCells))
allocate(qi_old(nVertLevels, nCells))
allocate(qs_old(nVertLevels, nCells))
allocate(qg_old(nVertLevels, nCells))
allocate(qv_tnd(nVertLevels, nCells))
allocate(qc_tnd(nVertLevels, nCells))
allocate(qr_tnd(nVertLevels, nCells))
allocate(qi_tnd(nVertLevels, nCells))
allocate(qs_tnd(nVertLevels, nCells))
allocate(qg_tnd(nVertLevels, nCells))

dt = 15.0_RKIND

do iCell = 1, nCells
do k = 1, nVertLevels
rho_zz_old(k,iCell) = 10.0_RKIND

qv_old(k,iCell) = 1.0_RKIND
qc_old(k,iCell) = 1.0_RKIND
qr_old(k,iCell) = 1.0_RKIND
qi_old(k,iCell) = 1.0_RKIND
qs_old(k,iCell) = 1.0_RKIND
qg_old(k,iCell) = 1.0_RKIND

qv_tnd(k,iCell) = 1.0_RKIND
qc_tnd(k,iCell) = 1.0_RKIND
qr_tnd(k,iCell) = 1.0_RKIND
qi_tnd(k,iCell) = 1.0_RKIND
qs_tnd(k,iCell) = 1.0_RKIND
qg_tnd(k,iCell) = 1.0_RKIND
end do
end do

!$acc data copyin(dt, rho_zz_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qv_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qc_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qr_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qi_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qs_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qg_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qv_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qc_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qr_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qi_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qs_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qg_tnd(1:nVertLevels, 1:nCellsSolve))
!$acc kernels loop gang vector collapse(2) independent
do iCell = 1, nCellsSolve
do k = 1, nVertLevels
inv_rho_zz_old = 1.0_RKIND / rho_zz_old(k,iCell)

qv_old(k,iCell) = qv_old(k,iCell)+dt*qv_tnd(k,iCell)inv_rho_zz_old
qc_old(k,iCell) = qc_old(k,iCell)+dt
qc_tnd(k,iCell)inv_rho_zz_old
qr_old(k,iCell) = qr_old(k,iCell)+dt
qr_tnd(k,iCell)inv_rho_zz_old
qi_old(k,iCell) = qi_old(k,iCell)+dt
qi_tnd(k,iCell)inv_rho_zz_old
qs_old(k,iCell) = qs_old(k,iCell)+dt
qs_tnd(k,iCell)inv_rho_zz_old
qg_old(k,iCell) = qg_old(k,iCell)+dt
qg_tnd(k,iCell)*inv_rho_zz_old

qv_tnd(k,iCell) = 0.0_RKIND
qc_tnd(k,iCell) = 0.0_RKIND
qr_tnd(k,iCell) = 0.0_RKIND
qg_tnd(k,iCell) = 1.0_RKIND
end do
end do

!$acc data copyin(dt, rho_zz_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qv_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qc_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qr_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qi_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qs_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qg_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qv_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qc_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qr_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qi_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qs_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qg_tnd(1:nVertLevels, 1:nCellsSolve))
!$acc kernels loop gang vector collapse(2) independent
do iCell = 1, nCellsSolve
do k = 1, nVertLevels
inv_rho_zz_old = 1.0_RKIND / rho_zz_old(k,iCell)

qv_old(k,iCell) = qv_old(k,iCell)+dt*qv_tnd(k,iCell)inv_rho_zz_old
qc_old(k,iCell) = qc_old(k,iCell)+dt
qc_tnd(k,iCell)inv_rho_zz_old
qr_old(k,iCell) = qr_old(k,iCell)+dt
qr_tnd(k,iCell)inv_rho_zz_old
qi_old(k,iCell) = qi_old(k,iCell)+dt
qi_tnd(k,iCell)inv_rho_zz_old
qs_old(k,iCell) = qs_old(k,iCell)+dt
qs_tnd(k,iCell)inv_rho_zz_old
qg_old(k,iCell) = qg_old(k,iCell)+dt
qg_tnd(k,iCell)*inv_rho_zz_old

qv_tnd(k,iCell) = 0.0_RKIND
qc_tnd(k,iCell) = 0.0_RKIND
qr_tnd(k,iCell) = 0.0_RKIND
qi_tnd(k,iCell) = 0.0_RKIND
qs_tnd(k,iCell) = 0.0_RKIND
qg_tnd(k,iCell) = 0.0_RKIND
end do
end do
!$acc end kernels
!$acc end data

end program tst


And compiled with:

pgf90 -acc -Minfo=accel -Mprof=time -r8 -O3 -byteswapio -Mfree tst.F


It works with “!$acc kernels”, and “!$acc parallel”.



Then I come back to my original code, and only add directives to this loop.
It failed with “!$acc parallel”, and again worked with “!$acc kernels”, and it runs fine.

It is truly hard to isolate it further, as it is within a big code, running with MPI (on 12 procs).

I saw a new version of PGI come out. Do you think it is a good idea to move to the new version?

Thanks,

Wei

Do you think it is a good idea to move to the new version?

It’s worth a try.

  • Mat