# four demanding loops

Hi there,

I have three quite demanding loops (pid2, pid3 and pid4) each running to ~2000 (counter1) and a loop within a subroutine wich runs from 1 to 7:

``````   !\$acc parallel loop reduction(+:erg,counter4) private(DDDD_4T_temp(1:7)) collapse(2)
do pid2 = 1, counter1
do pid3 = 1, counter1
x2 = part_pos_rc1(1,pid2)
y2 = part_pos_rc1(2,pid2)
z2 = part_pos_rc1(3,pid2)

r12x = x2-x1
r12y = y2-y1
r12z = z2-z1
r12_2= r12x*r12x + r12y*r12y + r12z*r12z
r12  = sqrt(r12_2)

!do pid3 = 1, counter1  ...not here due to collapse(2)

if (pid3==pid2) cycle

x3 = part_pos_rc1(1,pid3)
y3 = part_pos_rc1(2,pid3)
z3 = part_pos_rc1(3,pid3)

r23x = x3 - x2
r23y = y3 - y2
r23z = z3 - z2

r23_2 = r23x*r23x + r23y*r23y + r23z*r23z

if (r23_2 .lt. rc2) then

r23  = sqrt(r23_2)

!\$acc loop independent
do pid4 = 1, counter1

if (pid4==pid3 .or. pid4==pid2) cycle

r24x = part_pos_rc1(1,pid4) - x2
r24y = part_pos_rc1(2,pid4) - y2
r24z = part_pos_rc1(3,pid4) - z2

r24_2 = r24x*r24x + r24y*r24y + r24z*r24z

if (r24_2 .ge. rc2) cycle

r34x = part_pos_rc1(1,pid4) - x3
r34y = part_pos_rc1(2,pid4) - y3
r34z = part_pos_rc1(3,pid4) - z3

r34_2 = r34x*r34x + r34y*r34y + r34z*r34z

if (r34_2 .lt. rc2) then

r34  = sqrt(r34_2)

r41x = x1 - part_pos_rc1(1,pid4)
r41y = y1 - part_pos_rc1(2,pid4)
r41z = z1 - part_pos_rc1(3,pid4)

r41_2= r41x*r41x + r41y*r41y + r41z*r41z
r41  = sqrt(r41_2)

! calculates CCCC_temp(7),...,DDDD_4u5T_temp(7) using
! !\$acc routine vector   within the subroutine potentials with
! !\$acc loop vector      bevore the actual loop (do i = 1, 7  ...)
call ptt  (r12,r23,r34,r41,  &
r12_2,r23_2,r34_2,r41_2,&
r12x, r23x, r34x, r41x, &
r12y, r23y, r34y, r41y, &
r12z, r23z, r34z, r41z, &
CCCC_4T_temp(1:7), &
DDDD_4T_temp(1:7), DDDD_5T_temp(1:7), DDDD_4u5T_temp(1:7) )

counter4 = counter4 + 1

! I use erg to test this parallelization..
erg = erg + DDDD_4T_temp(7)

!         ..but I'm actually interested in these arrays:
!              CCCC_4T(:)   = CCCC_4T(:) + CCCC_4T_temp(:)
!              DDDD_4T(:)   = DDDD_4T(:) + DDDD_4T_temp(:)
!              DDDD_5T(:)   = DDDD_5T(:)   + DDDD_5T_temp(:)
!              DDDD_4u5T(:) = DDDD_4u5T(:) + DDDD_4u5T_temp(:)

end if

end do !pid4

end if

end do ! pid3
end do ! pid2
!\$acc end parallel
``````

Furthermore, here are some information of the compiler:

launch CUDA kernel:
line=591 device=0 threadid=1 num_gangs=65535 num_workers=1 vector_length=32 grid=65535 block=32 shared memory=2048
launch CUDA kernel:
line=591 device=0 threadid=1 num_gangs=4 num_workers=1 vector_length=256 grid=4 block=256 shared memory=2048

Here is also the output of the visual profiler: The main problem I have is a very low speed-up… Do you have any ideas to improve the efficiency??

Thank you very much in advance!

Try making “ptt” a sequential routine, “routine seq” and then scheduling the counter1 loops as “gang vector”. You’re loosing a lot of performance since the you’ll have only 1 out of 128 vectors doing useful work. Plus you’re only using 7 vectors in “ptt”.

• Mat

Hi Mat,

thanks for the reply! I have changed my subroutine “ptt” to seq. Furthermore, I restructured my counter1 loops to:

``````!\$acc kernels
!\$acc loop gang reduction(+:erg,counter4) private(DDDD_4T_temp)
do pid2 = 1, counter1
!\$acc loop worker
do pid3 = 1, counter1
!\$acc loop vector
do pid4 = 1, counter1
``````

This leads to the following mapping:
num_gangs=19504886 num_workers=4 vector_length=32 grid=43x1346x337 block=32x4 shared memory=2048

The computational time improved to ~25% in contrast to the non-parallelized program… but don`t you think that there is much more potential in there?

many thanks!

… I have also tried this way

``````    !\$acc loop gang vector reduction(+:erg,counter4) private(DDDD_4T_temp)
do pid2 = 1, counter1
do pid3 = 1, counter1
do pid4 = 1, counter1
``````

…which leads to the same mapping and computational time as in the example before.

The next schedule I’d try is a collapse(3).

``````    !\$acc loop gang vector reduction(+:erg,counter4) private(DDDD_4T_temp)  collapse(3)
do pid2 = 1, counter1
do pid3 = 1, counter1
do pid4 = 1, counter1
``````

Other things to look at:

“part_pos_rc1” is not being accessed contiguously across the “vectors” which will cause some memory divergence. The best fix would be switch how you’re indexing “part_pos_rc1” so that “pid” is the column and 1,2, or 3 is the row. For example: “part_pos_rc1(pid4,1)”. Alternately since the array is read-only, you can decorate it with an “INTENT(IN)” (assuming it an argument) and the complier will try to put the array in Textured memory.

Given the number of local scalar variables, you’re register usage may be high, thus lowering your occupancy. Compile with “-ta=tesla:ptxinfo” to see the register usage. If it’s above 32 registers per thread, then you’re losing occupancy. Granted having a high occupancy does not guarantee good performance and having 50% or above is considered good. You can use the flag “-ta=tesla:maxregcount:” to limit the number of registers per thread to increase the occupancy. However, local variables will then spill. If they only spill to the L1 cache, then you’re fine. But if they spill to global memory, then your performance will suffer. It may take some experimentation as to what the optimal number of registers should be.

• Mat