parallelize a fortran loop.

I have a fortran loop, with OpenACC directives added:

!$acc data copyin(cellsOnEdge,h), copyout(h_edge)
!$acc kernels
!$acc loop private(cell1, cell2)
do iEdge=1,nEdges
cell1 = cellsOnEdge(1,iEdge)
cell2 = cellsOnEdge(2,iEdge)
do k=1,nVertLevels
h_edge(k,iEdge) = 0.5 * (h(k,cell1) + h(k,cell2))
end do
end do
!$acc end kernels
!$acc end data

If I compile with:

pgf90 -acc -Minfo=accel -Mprof=time -O3 [file.F90]

I got:

28, Generating copyin(cellsonedge(:,:),h(:,:))
Generating copyout(h_edge(:,:))
31, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
31, !$acc loop gang ! blockidx%x
34, !$acc loop vector(64) ! threadidx%x
Loop is parallelizable


But the same code in a multiple files, compile with MPI (mpif90), and I got:

3986, Generating copyin(cellsonedge(:,:),h(:,:))
Generating copyout(h_edge(:,:))
3989, Complex loop carried dependence of cellsonedge,nvertlevels,h prevents parallelization
Loop carried dependence of h_edge prevents parallelization
Loop carried backward dependence of h_edge prevents vectorization
Complex loop carried dependence of h_edge prevents parallelization
Accelerator kernel generated
Generating Tesla code


Also the isolated program, with GPU uses more time then without GPU.

Thanks for your help.

Wei

Hi Wei,

What type is “h_edge”? My best guess is that in some spots “h_edge” is a pointer and others it’s an allocatable. If this is the case, add the “independent” clause to your loop directive.

What are the sizes of “nEdges” and “nVertLevels”?

Since the compiler is using a vector length of 64, my assumption is that nVertLevels is a known value and relatively small.

Please set the environment variable “PGI_ACC_TIME=1”, run your program, and post the output. This will show basic profile information about how much time each kernel is taking as well as the amount of time it takes to copy data and give us a better idea on how to improve the performance.

  • Mat

  • Mat

I followed Mat’s suggestion, and modified the loop as:

4027 !
4028 !$acc data copyin(cellsOnEdge,h), copyout(h_edge)
4029 !$acc kernels
4030 !$acc loop independent private(cell1, cell2)
4031 do iEdge=1,nEdges
4032 cell1 = cellsOnEdge(1,iEdge)
4033 cell2 = cellsOnEdge(2,iEdge)
4034 do k=1,nVertLevels
4035 h_edge(k,iEdge) = 0.5 * (h(k,cell1) + h(k,cell2))
4036 end do
4037 end do
4038 !$acc end kernels
4039 !$acc end data
4040

Where nEdges is 40961, and nVertLevels is 41.
cellsOnEdge will be from 1 to 13653 (which is nCells),
as h points to a dimension point of dimension(nVertlevels, nCells).
where h_edge points to a dimension point of (nVertLevels, nEdges).

(In my simplified test code, I have h and h_edge as plain dimension.)

We still have error:
4028, Generating copyin(cellsonedge(:,:),h(:,:))
Generating copyout(h_edge(:,:))
4031, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
4031, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
4034, Complex loop carried dependence of h,h_edge prevents parallelization
Inner sequential loop scheduled on accelerator


Thanks,

Wei

4031, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
4034, Complex loop carried dependence of h,h_edge prevents parallelization
Inner sequential loop scheduled on accelerator

Yes, this is because you would also need to add a “loop independent” directive on the “k” to tell the compiler that the loop doesn’t have any dependencies.

However, given that “nEdges” is very large and “nVertLevels” is small, you might want to try a few schedules to see which one works best.

First, time the current schedule of “gang vector(128)” on just the iEdge loop. The one problem here is that “iEdge” isn’t the index for the stride-1 dimension so you might have inefficient memory access. To help this, try adding “INTENT(IN)” to cellsOnEdge and h (assuming they are passed as arguments) so they will be placed in texture memory. Alternatively if you can change your data layout, change the arrays so that “iEdge” and “cell1” are in the leading dimensions.

Another experiment is to add “!$acc loop vector(32) independent” to the “k” loop, but also add “worker(4)” to to the “iEdge” loop.

4029 !$acc kernels 
4030 !$acc loop gang worker(4) independent
4031 do iEdge=1,nEdges 
4032 cell1 = cellsOnEdge(1,iEdge) 
4033 cell2 = cellsOnEdge(2,iEdge) 
4034 !$acc loop vector(32) independent        
4035 do k=1,nVertLevels

With this, you’ll now have stride-1 memory access but some threads will be idle given nVertLevels is not divisible by 32. Also, there’s not a lot computation so not much for each thread to do.

Finally, I’d also run the experiment with 2 workers and a vector length of 64.

Let me know how your experiment goes. Be sure to use the output from PGI_ACC_TIME to see the performance.

  • Mat

Mat,

h, and h_edge are pointers,

real (kind=RKIND), dimension(:,:), pointer :: h_edge, h
integer, dimension(:,:), pointer :: cellsOnEdge

As this is not my code, and it is in a big model, so I won’t be able to switch the dimension index.

I tried with code:

4027 !
4028 !$acc data copyin(cellsOnEdge,h), copyout(h_edge)
4029 !$acc kernels
4030 !$acc loop gang worker(4) independent, &
4031 !$acc private(cell1, cell2)
4032 do iEdge=1,nEdges
4033 cell1 = cellsOnEdge(1,iEdge)
4034 cell2 = cellsOnEdge(2,iEdge)
4035 !$acc loop vector(64) independent
4036 do k=1,nVertLevels
4037 h_edge(k,iEdge) = 0.5 * (h(k,cell1) + h(k,cell2))
4038 end do
4039 end do
4040 !$acc end kernels
4041 !$acc end data
4042


Now, it compiled:

4028, Generating copyin(cellsonedge(:,:),h(:,:))
Generating copyout(h_edge(:,:))
4032, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
4032, !$acc loop gang, worker(4) ! blockidx%x threadidx%y
4036, !$acc loop vector(64) ! threadidx%x
Loop is parallelizable


Here are run time msg:

launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_init_coupled_diagnostics line=4452 device=0 threadid=1 num_gangs=4660 num_workers=1 vector_length=128 grid=4660 block=128
launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_init_coupled_diagnostics line=4461 device=0 threadid=1 num_gangs=4660 num_workers=1 vector_length=128 grid=4660 block=128
launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_init_coupled_diagnostics line=4470 device=0 threadid=1 num_gangs=4660 num_workers=1 vector_length=128 grid=4660 block=128
launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_init_coupled_diagnostics line=4480 device=0 threadid=1 num_gangs=4660 num_workers=1 vector_length=128 grid=4660 block=128
launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_init_coupled_diagnostics line=4489 device=0 threadid=1 num_gangs=4660 num_workers=1 vector_length=128 grid=4660 block=128
launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_compute_solve_diagnostics line=4032 device=0 threadid=1 num_gangs=11021 num_workers=4 vector_length=64 grid=11021 block=64x4 shared memory=64

Accelerator Kernel Timing data
/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F
atm_compute_solve_diagnostics NVIDIA devicenum=0
time(us): 4,123
4028: data region reached 1 time
4028: data copyin transfers: 5
device time(us): total=4,123 max=2,328 min=15 avg=824
4029: compute region reached 1 time
4032: kernel launched 1 time
grid: [11021] block: [64x4]
device time(us): total=0 max=0 min=0 avg=0
/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F
atm_init_coupled_diagnostics NVIDIA devicenum=0
time(us): 19,726
4447: data region reached 1 time
4447: data copyin transfers: 19
device time(us): total=11,301 max=1,699 min=7 avg=594
4449: compute region reached 1 time
4452: kernel launched 1 time
grid: [4660] block: [128]
elapsed time(us): total=577 max=577 min=577 avg=577
4458: compute region reached 1 time
4461: kernel launched 1 time
grid: [4660] block: [128]
elapsed time(us): total=375 max=375 min=375 avg=375
4467: compute region reached 1 time
4470: kernel launched 1 time
grid: [4660] block: [128]
elapsed time(us): total=2,029 max=2,029 min=2,029 avg=2,029
4477: compute region reached 1 time
4480: kernel launched 1 time
grid: [4660] block: [128]
elapsed time(us): total=4,620 max=4,620 min=4,620 avg=4,620
4486: compute region reached 1 time
4489: kernel launched 1 time
grid: [4660] block: [128]
elapsed time(us): total=719 max=719 min=719 avg=719
4496: data region reached 1 time
4496: data copyout transfers: 5
device time(us): total=8,425 max=1,774 min=1,443 avg=1,685
call to cuMemFreeHost returned error 700: Illegal address during kernel execution


Wei

call to cuMemFreeHost returned error 700: Illegal address during kernel execution

This means that there’s some memory access issue (similar to a segmentation violation on the CPU).

It could be a out-of-bounds error in your code. Or since these are pointers, you may need to include the size of the arrays in the data copy clauses.

4028 !$acc data copyin(cellsOnEdge(:2,:nEdges),h(:nVerLevels,:size), copyout(h_edge(:nVertLevels,:nEdges)

Does the error occur with other schedules or just the 4x64?

  • Mat

Mat,

It failed with 2x32 as well.

Changed the code to:

4027 !
4028 !$acc data copyin(cellsOnEdge(1:2, 1:nEdges), h(1:nVertLevels, 1:nCells+1)), &
4029 !$acc copyout(h_edge(1:nVertLevels, 1:nEdges))
4030 !$acc kernels
4031 !$acc loop gang worker(4) independent, &
4032 !$acc private(cell1, cell2)
4033 do iEdge=1,nEdges
4034 cell1 = cellsOnEdge(1,iEdge)
4035 cell2 = cellsOnEdge(2,iEdge)
4036 !$acc loop vector(64) independent
4037 do k=1,nVertLevels
4038 h_edge(k,iEdge) = 0.5 * (h(k,cell1) + h(k,cell2))
4039 end do
4040 end do
4041 !$acc end kernels
4042 !$acc end data
4043


But still get the error:

launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_init_coupled_diagnostics line=4457 device=0 threadi
d=1 num_gangs=4660 num_workers=1 vector_length=128 grid=4660 block=128
launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_init_coupled_diagnostics line=4466 device=0 threadi
d=1 num_gangs=4660 num_workers=1 vector_length=128 grid=4660 block=128
launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_init_coupled_diagnostics line=4475 device=0 threadi
d=1 num_gangs=4660 num_workers=1 vector_length=128 grid=4660 block=128
launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_init_coupled_diagnostics line=4485 device=0 threadi
d=1 num_gangs=4660 num_workers=1 vector_length=128 grid=4660 block=128
launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_init_coupled_diagnostics line=4494 device=0 threadi
d=1 num_gangs=4660 num_workers=1 vector_length=128 grid=4660 block=128
launch CUDA kernel file=/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F function=atm_compute_solve_diagnostics line=4033 device=0 thread
id=1 num_gangs=11021 num_workers=4 vector_length=64 grid=11021 block=64x4 shared memory=64

Accelerator Kernel Timing data
/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F
atm_compute_solve_diagnostics NVIDIA devicenum=0
time(us): 3,425
4028: data region reached 1 time
4028: data copyin transfers: 5
device time(us): total=3,425 max=2,592 min=44 avg=685
4030: compute region reached 1 time
4033: kernel launched 1 time
grid: [11021] block: [64x4]
device time(us): total=0 max=0 min=0 avg=0
/home/whuang/pgi/srcs/MPAS-Release-4.0/src/core_atmosphere/dynamics/mpas_atm_time_integration.F
atm_init_coupled_diagnostics NVIDIA devicenum=0
time(us): 25,351
4452: data region reached 1 time
4452: data copyin transfers: 19
device time(us): total=10,223 max=1,746 min=44 avg=538
4454: compute region reached 1 time
4457: kernel launched 1 time
grid: [4660] block: [128]
elapsed time(us): total=9,768 max=9,768 min=9,768 avg=9,768
4463: compute region reached 1 time
4466: kernel launched 1 time
grid: [4660] block: [128]
elapsed time(us): total=3,075 max=3,075 min=3,075 avg=3,075
4472: compute region reached 1 time
4475: kernel launched 1 time
grid: [4660] block: [128]
elapsed time(us): total=5,493 max=5,493 min=5,493 avg=5,493
4482: compute region reached 1 time
4485: kernel launched 1 time
grid: [4660] block: [128]
elapsed time(us): total=11,285 max=11,285 min=11,285 avg=11,285
4491: compute region reached 1 time
4494: kernel launched 1 time
grid: [4660] block: [128]
elapsed time(us): total=2,676 max=2,676 min=2,676 avg=2,676
4501: data region reached 1 time
4501: data copyout transfers: 5
device time(us): total=15,128 max=9,131 min=1,496 avg=3,025
call to cuMemFreeHost returned error 700: Illegal address during kernel execution


Thanks,

Wei

Could it be an out-of-bounds access? Have you double checked if all the values for “cell1” and “cell2” lie within the range of 1 to nCells+1?

Try running your host only code through Valgrind (www.valgrind.org) to see if there are any memory issues.

Also, you can try running the OpenACC version through cuda-memcheck to see what errors it sees.

The CUDA 7.5 cuda-gdb does have some limited support for OpenACC, so you can try compiling your code with “-g” and then running it through the debugger. This might indicate the exact line where the crash is occuring.

Finally, you can enable the environment flag “PGI_ACC_DEBUG” to have the PGI runtime show you every device call.

Hope this helps,
Mat