Parallelizing a loop

Huang_Wei · February 25, 2016, 8:10pm

Hello,

I have a loop parallelized as:

3460
3461 !$acc data copyin(dt, rho_zz_old(1:nVertLevels, 1:nCellsSolve)), &
3462 !$acc copy(qv_old(1:nVertLevels, 1:nCellsSolve)), &
3463 !$acc copy(qc_old(1:nVertLevels, 1:nCellsSolve)), &
3464 !$acc copy(qr_old(1:nVertLevels, 1:nCellsSolve)), &
3465 !$acc copy(qi_old(1:nVertLevels, 1:nCellsSolve)), &
3466 !$acc copy(qs_old(1:nVertLevels, 1:nCellsSolve)), &
3467 !$acc copy(qg_old(1:nVertLevels, 1:nCellsSolve)), &
3468 !$acc copy(qv_tnd(1:nVertLevels, 1:nCellsSolve)), &
3469 !$acc copy(qc_tnd(1:nVertLevels, 1:nCellsSolve)), &
3470 !$acc copy(qr_tnd(1:nVertLevels, 1:nCellsSolve)), &
3471 !$acc copy(qi_tnd(1:nVertLevels, 1:nCellsSolve)), &
3472 !$acc copy(qs_tnd(1:nVertLevels, 1:nCellsSolve)), &
3473 !$acc copy(qg_tnd(1:nVertLevels, 1:nCellsSolve))
3474 !$acc parallel loop gang vector collapse(2) independent private(inv_rho_zz_old)
3475 do iCell = 1, nCellsSolve
3476 do k = 1, nVertLevels
3477 inv_rho_zz_old = 1.0_RKIND / rho_zz_old(k,iCell)
3478
3479 qv_old(k,iCell) = qv_old(k,iCell)+dt*qv_tnd(k,iCell)inv_rho_zz_old
3480 qc_old(k,iCell) = qc_old(k,iCell)+dtqc_tnd(k,iCell)inv_rho_zz_old
3481 qr_old(k,iCell) = qr_old(k,iCell)+dtqr_tnd(k,iCell)inv_rho_zz_old
3482 qi_old(k,iCell) = qi_old(k,iCell)+dtqi_tnd(k,iCell)inv_rho_zz_old
3483 qs_old(k,iCell) = qs_old(k,iCell)+dtqs_tnd(k,iCell)inv_rho_zz_old
3484 qg_old(k,iCell) = qg_old(k,iCell)+dtqg_tnd(k,iCell)*inv_rho_zz_old
3485
3486 qv_tnd(k,iCell) = 0.0_RKIND
3487 qc_tnd(k,iCell) = 0.0_RKIND
3488 qr_tnd(k,iCell) = 0.0_RKIND
3489 qi_tnd(k,iCell) = 0.0_RKIND
3490 qs_tnd(k,iCell) = 0.0_RKIND
3491 qg_tnd(k,iCell) = 0.0_RKIND
3492 end do
3493 end do
3494 !$acc end parallel
3495 !$acc end data
3496

It compiled OK, and have problem to run it.
The message is:

3474: compute region reached 1 time
3474: kernel launched 1 time
grid: [4382] block: [128]
device time(us): total=0 max=0 min=0 avg=0

The MPI aborted with error:

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 46871 RUNNING AT htdgfxl07.universalweather.rdn
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

It runs fine if this loop does not have OpenACC directives.

Any idea?

Thanks,

Wei

MatColgrove · February 25, 2016, 9:48pm

Hi Wei,

Unfortunately, there’s not much to go on.

Were there any other messages before the profile? You might try turning off the profile info by not setting PGI_ACC_TIME to see if that better shows the error.

Another thing to try is setting “PGI_ACC_DEBUG=1”. This will dump information about every call to the OpenACC runtime. It can be a lot of info but can he helpful in getting an idea of where the error is coming from.

Once you can get what the actual error is, then we can investigate why it’s occurring.

Mat

Huang_Wei · February 26, 2016, 4:41pm

Mat,

Here are the info with PGI_ACC_DEBUG=1:

src/core_atmosphere/dynamics/mpas_atm_time_integration.F
atm_advance_scalars_mono NVIDIA devicenum=0
time(us): 19,207
3461: data region reached 1 time
3461: data copyin transfers: 27
device time(us): total=19,207 max=1,436 min=44 avg=711
3474: compute region reached 1 time
3474: kernel launched 1 time
grid: [4382] block: [128]
device time(us): total=0 max=0 min=0 avg=0

The function info:

Function 11 = 0x11b1f40 Function 11 = 0x11b1f40 = atm_advance_scalars_mono_3474_gpu
3474 = lineno
128x1x1 = block size
-1x1x1 = grid size
0x1 = config_flag = divx(0)
1x1x1 = unroll
0 = shared memory
0 = reduction shared memory
0 = reduction arg
0 = reduction bytes
336 = argument bytes
0 = max argument bytes
1 = size arguments

pgi_uacc_computestart( file=/home/whuang/pgi/srcs/reorg_mpas_scalars/src/core_atmosphere/dynamics/mpas_atm_time_integration.F, function=atm_advance_scalars_mono, line=3330:3330, line=3474, devid=0 )
pgi_uacc_launch funcnum=11 argptr=0x7ffdd2242320 sizeargs=0x7ffdd2242310 async=-1 devid=1
Arguments to function 11 atm_advance_scalars_mono_3474_gpu dindex=1 threadid=1 device=0:
41 0 161218560 66 71565312 66 36455936 66
66715648 66 47316992 66 81264640 66 76414976 66
0x00000029 0x00000000 0x099c0000 0x00000042 0x04440000 0x00000042 0x022c4600 0x00000042
0x03fa0000 0x00000042 0x02d20000 0x00000042 0x04d80000 0x00000042 0x048e0000 0x00000042
Launch configuration for function=11=atm_advance_scalars_mono_3474_gpu line=3474 dindex=1 threadid=1 device=0 <<<(4382,1,1),(128,1,1),0>>>

At the end it has:

call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Another interesting thing:
I tried change “parallel” on line 3474/3494 to “kernels”,
this loop worked.

Thanks,

Wei

MatColgrove · February 26, 2016, 8:35pm

Hi Wei,

I’m curious what the compiler feedback messages are (-Minfo=accel) for both the “parallel” and “kernels” versions.

Also, what happens if you remove “private(inv_rho_zz_old)”? It’s not needed here since scalars are private by default and more important, will be declared locally within the kernel versus global memory. I doubt it’s the cause of your error, but worth a try.

Mat

Huang_Wei · February 26, 2016, 10:27pm

Mat,

Here are the compiler msg:

with kernels
atm_advance_scalars_mono:
3461, Generating copyin(dt,rho_zz_old(1:nvertlevels,1:ncellssolve))
Generating copy(qv_old(1:nvertlevels,1:ncellssolve),qc_old(1:nvertlevels,1:ncellssolve),qr_old(1:nvertlevels,1:ncellssolve),qi_old(1:nvertlevels,1:ncellssolve),qs_old(1:nvertlevels,1:ncellssolve),qg_old(1:nvertlevels,1:ncellssolve),qv_tnd(1:nvertlevels,1:ncellssolve),qc_tnd(1:nvertlevels,1:ncellssolve),qr_tnd(1:nvertlevels,1:ncellssolve),qi_tnd(1:nvertlevels,1:ncellssolve),qs_tnd(1:nvertlevels,1:ncellssolve),qg_tnd(1:nvertlevels,1:ncellssolve))
3475, Loop is parallelizable
3476, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
3475, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
3476, ! blockidx%x threadidx%x collapsed
with parallel
atm_advance_scalars_mono:
3461, Generating copyin(dt,rho_zz_old(1:nvertlevels,1:ncellssolve))
Generating copy(qv_old(1:nvertlevels,1:ncellssolve),qc_old(1:nvertlevels,1:ncellssolve),qr_old(1:nvertlevels,1:ncellssolve),qi_old(1:nvertlevels,1:ncellssolve),qs_old(1:nvertlevels,1:ncellssolve),qg_old(1:nvertlevels,1:ncellssolve),qv_tnd(1:nvertlevels,1:ncellssolve),qc_tnd(1:nvertlevels,1:ncellssolve),qr_tnd(1:nvertlevels,1:ncellssolve),qi_tnd(1:nvertlevels,1:ncellssolve),qs_tnd(1:nvertlevels,1:ncellssolve),qg_tnd(1:nvertlevels,1:ncellssolve))
3474, Accelerator kernel generated
Generating Tesla code
3475, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
3476, ! blockidx%x threadidx%x collapsed

The only difference I see here is that with parallel, there is “3474,” added before Accelerator kernel generated.

Remove “private(inv_rho_zz_old)” and compile with parallel still did not work.
Works fine when compile with kernels.

Thanks,

Wei

Huang_Wei · February 26, 2016, 10:34pm

Mat,

Here is another loop, which runs fines without OpenACC directive,
#ifdef _OPENACC
!!acc data copyin(inv_config_dt), &
!!acc copyin(qv_1(1:nVertLevels, 1:nCells)), &
!!acc copyin(qv_2(1:nVertLevels, 1:nCells)), &
!!acc copyout(rqvdynten(1:nVertLevels, 1:nCells))
!!acc parallel loop vector collapse(2) independent
do iCell = 1,nCells
do k = 1,nVertLevels
rqvdynten(k, iCell) = ( qv_2(k, iCell) - qv_1(k, iCell) ) * inv_config_dt
end do
end do
!!acc end parallel
!!acc end data
#else
rqvdynten(:,:) = ( qv_2(:,:) - qv_1(:,:) ) * inv_config_dt
#endif

It compiles fine with kernels, or parallel by turning the directives on (change !!acc to !$acc). But both failed to run.

This loop looks so simple, it really makes me frustrated.

Any suggestions?

Thanks,

Wei

MatColgrove · February 29, 2016, 6:21pm

Hi Wei,

I’m going to need a reproducing example. I understand that your code is confidential, but could you try and extract the failing case, obfuscate the code, and send it to PGI Customer Service (trs@pgroup.com)?

I agree that the code does look simple and something that’s been done many times before. Hence, there’s something else going on.

Note I have seen instances where one part of stomping on memory in one part of the code causing a failure in a completely different section of code. I have no idea if that’s the case here, but one thing to try is turn off all OpenACC code, except this bit and see if it still fails.

Mat

Huang_Wei · February 29, 2016, 11:17pm

Mat,

It is kind of hard to get a simple reproducible code.

I’ll try what you suggested: Just turn on directives fr this loop, and see how things going.

Thanks,

Wei

Huang_Wei · March 1, 2016, 3:27pm

Mat,

I tried to create a short program just have the loop, as:

program tst
implicit none

#ifdef SINGLE_PRECISION
integer, parameter :: RKIND = selected_real_kind(6)
#else
integer, parameter :: RKIND = selected_real_kind(12)
#endif

integer, parameter :: nVertLevels = 41
integer, parameter :: nCells = 163842
integer, parameter :: nCellsSolve = 13679

integer :: iCell, k

real (kind=RKIND), dimension(:,:), pointer :: rho_zz_old
real (kind=RKIND), dimension(:,:), pointer :: qv_old, qc_old, qr_old, qi_old, qs_old, qg_old
real (kind=RKIND), dimension(:,:), pointer :: qv_tnd, qc_tnd, qr_tnd, qi_tnd, qs_tnd, qg_tnd
real (kind=RKIND) :: inv_rho_zz_old, dt

allocate(rho_zz_old(nVertLevels, nCells))
allocate(qv_old(nVertLevels, nCells))
allocate(qc_old(nVertLevels, nCells))
allocate(qr_old(nVertLevels, nCells))
allocate(qi_old(nVertLevels, nCells))
allocate(qs_old(nVertLevels, nCells))
allocate(qg_old(nVertLevels, nCells))
allocate(qv_tnd(nVertLevels, nCells))
allocate(qc_tnd(nVertLevels, nCells))
allocate(qr_tnd(nVertLevels, nCells))
allocate(qi_tnd(nVertLevels, nCells))
allocate(qs_tnd(nVertLevels, nCells))
allocate(qg_tnd(nVertLevels, nCells))

dt = 15.0_RKIND

do iCell = 1, nCells
do k = 1, nVertLevels
rho_zz_old(k,iCell) = 10.0_RKIND

qv_old(k,iCell) = 1.0_RKIND
qc_old(k,iCell) = 1.0_RKIND
qr_old(k,iCell) = 1.0_RKIND
qi_old(k,iCell) = 1.0_RKIND
qs_old(k,iCell) = 1.0_RKIND
qg_old(k,iCell) = 1.0_RKIND

qv_tnd(k,iCell) = 1.0_RKIND
qc_tnd(k,iCell) = 1.0_RKIND
qr_tnd(k,iCell) = 1.0_RKIND
qi_tnd(k,iCell) = 1.0_RKIND
qs_tnd(k,iCell) = 1.0_RKIND
qg_tnd(k,iCell) = 1.0_RKIND
end do
end do

!$acc data copyin(dt, rho_zz_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qv_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qc_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qr_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qi_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qs_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qg_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qv_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qc_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qr_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qi_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qs_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qg_tnd(1:nVertLevels, 1:nCellsSolve))
!$acc kernels loop gang vector collapse(2) independent
do iCell = 1, nCellsSolve
do k = 1, nVertLevels
inv_rho_zz_old = 1.0_RKIND / rho_zz_old(k,iCell)

qv_old(k,iCell) = qv_old(k,iCell)+dt*qv_tnd(k,iCell)inv_rho_zz_old
qc_old(k,iCell) = qc_old(k,iCell)+dtqc_tnd(k,iCell)inv_rho_zz_old
qr_old(k,iCell) = qr_old(k,iCell)+dtqr_tnd(k,iCell)inv_rho_zz_old
qi_old(k,iCell) = qi_old(k,iCell)+dtqi_tnd(k,iCell)inv_rho_zz_old
qs_old(k,iCell) = qs_old(k,iCell)+dtqs_tnd(k,iCell)inv_rho_zz_old
qg_old(k,iCell) = qg_old(k,iCell)+dtqg_tnd(k,iCell)*inv_rho_zz_old

qv_tnd(k,iCell) = 0.0_RKIND
qc_tnd(k,iCell) = 0.0_RKIND
qr_tnd(k,iCell) = 0.0_RKIND
qg_tnd(k,iCell) = 1.0_RKIND
end do
end do

!$acc data copyin(dt, rho_zz_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qv_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qc_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qr_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qi_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qs_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qg_old(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qv_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qc_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qr_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qi_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qs_tnd(1:nVertLevels, 1:nCellsSolve)), &
!$acc copy(qg_tnd(1:nVertLevels, 1:nCellsSolve))
!$acc kernels loop gang vector collapse(2) independent
do iCell = 1, nCellsSolve
do k = 1, nVertLevels
inv_rho_zz_old = 1.0_RKIND / rho_zz_old(k,iCell)

qv_old(k,iCell) = qv_old(k,iCell)+dt*qv_tnd(k,iCell)inv_rho_zz_old
qc_old(k,iCell) = qc_old(k,iCell)+dtqc_tnd(k,iCell)inv_rho_zz_old
qr_old(k,iCell) = qr_old(k,iCell)+dtqr_tnd(k,iCell)inv_rho_zz_old
qi_old(k,iCell) = qi_old(k,iCell)+dtqi_tnd(k,iCell)inv_rho_zz_old
qs_old(k,iCell) = qs_old(k,iCell)+dtqs_tnd(k,iCell)inv_rho_zz_old
qg_old(k,iCell) = qg_old(k,iCell)+dtqg_tnd(k,iCell)*inv_rho_zz_old

qv_tnd(k,iCell) = 0.0_RKIND
qc_tnd(k,iCell) = 0.0_RKIND
qr_tnd(k,iCell) = 0.0_RKIND
qi_tnd(k,iCell) = 0.0_RKIND
qs_tnd(k,iCell) = 0.0_RKIND
qg_tnd(k,iCell) = 0.0_RKIND
end do
end do
!$acc end kernels
!$acc end data

end program tst

And compiled with:

pgf90 -acc -Minfo=accel -Mprof=time -r8 -O3 -byteswapio -Mfree tst.F

It works with “!$acc kernels”, and “!$acc parallel”.

Then I come back to my original code, and only add directives to this loop.
It failed with “!$acc parallel”, and again worked with “!$acc kernels”, and it runs fine.

It is truly hard to isolate it further, as it is within a big code, running with MPI (on 12 procs).

I saw a new version of PGI come out. Do you think it is a good idea to move to the new version?

Thanks,

Wei

MatColgrove · March 1, 2016, 10:43pm

Do you think it is a good idea to move to the new version?

It’s worth a try.

Mat

Topic		Replies	Views
parallelize a fortran loop. Legacy PGI Compilers	7	6489	December 11, 2015
How to parallelize this loop... Legacy PGI Compilers	14	7839	December 18, 2012
accelerator parallization issues Legacy PGI Compilers	18	26772	April 12, 2010
Couple of questions (nested loops, loop bounds, etc.) Legacy PGI Compilers	17	25118	December 11, 2014
Add OpenACC to a Fortran loop Legacy PGI Compilers	5	7165	December 3, 2015
Need advice for OpenACC directives Legacy PGI Compilers	6	7314	July 5, 2016
Question: The OpenAcc directive "!$acc parallel loop" does not work in Community Edition 19.4 Legacy PGI Compilers	3	2821	April 1, 2020
Code execution depends strangely on irrelevant parameters Legacy PGI Compilers	8	8105	October 22, 2013
The result changes, after adding the openacc statement Legacy PGI Compilers	11	3858	July 21, 2018
How to not parallelize inner loops in OpenACC ? Legacy PGI Compilers	7	3758	May 1, 2020

Parallelizing a loop

3474: compute region reached 1 time 3474: kernel launched 1 time grid: [4382] block: [128] device time(us): total=0 max=0 min=0 avg=0 The MPI aborted with error:

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 46871 RUNNING AT htdgfxl07.universalweather.rdn = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

Related topics

3474: compute region reached 1 time
3474: kernel launched 1 time
grid: [4382] block: [128]
device time(us): total=0 max=0 min=0 avg=0

The MPI aborted with error:

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 46871 RUNNING AT htdgfxl07.universalweather.rdn
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES