 # best strategy / HowTo for nested loops

Hi,

we want to accelerate a CFD code with the PGI accelerator model.
Unfortuneately, I couldn’t find a HowTo or extract a strategy for our code
out of the several documentations. Hopefully, someone can help me to reduce
my confusion ;-).

So, here are the facts:

• we have loops over the 3 dimensions of our grid: I,J,K with the length
of about 4 to 100 each. (loop length is not know at compile time)

• We use Fortran with the PGIAccelerator (11.5)

• We have four C2050 cards

a code without dependancies to neighbouring cells would probably look like

``````      do K = 1, Kend
do J = 1, Jend
do I = 1, Iend
rhoV2(I,J,K) = ( urho(I,J,K,MM)* urho(I,J,K,MM)
&           +           vrho(I,J,K,MM)* vrho(I,J,K,MM)
&           +           wrho(I,J,K,MM)* wrho(I,J,K,MM) )
&           /         rho(I,J,K,MM)
end do
end do
end do

do is = 1, ns
do K = 1, Kend
do J = 1, Jend
do I = 1, Iend
bet1(I,J,K)
&             = bet1(I,J,K)
&             + xi(I,J,K,is,MM)*Rho(I,J,K,MM) * Cvtr1(is)
end do
end do
end do
end do
``````

a code with dependancies to neighbouring cells would probably look like

``````      do is = 1, nh
do K = 2, Kre2(mm)-1
do J = 2, Jre2(mm)-1
do I = 2, Ire2(mm)-1

dxdt(i,j,k,is) = dxdt(i,j,k,is)
&      +  XI_flux_rho_l(I,J,K,MM)
&      *  ( xi(i_lft(I,J,K,MM),J,K,is,MM)  )
&      +  XI_flux_rho_r(I,J,K,MM)
&      *  ( xi(i_rght(I,J,K,MM),J,K,is,MM)  )
&      +  ET_flux_rho_l(I,J,K,MM)
&      *  ( xi(I,j_lft(I,J,K,MM),K,is,MM)  )
&      +  ET_flux_rho_r(I,J,K,MM)
&      *  ( xi(I,j_rght(I,J,K,MM),K,is,MM)  )
&      +  ZE_flux_rho_l(I,J,K,MM)
&      *  ( xi(I,J,k_lft(I,J,K,MM),is,MM) )
&      +  ZE_flux_rho_r(I,J,K,MM)
&      *  ( xi(I,J,k_rght(I,J,K,MM),is,MM)  )

enddo
enddo
enddo
enddo
``````

where i_lft etc is the index of the left neighbour or the cell itself, depending on the fluxes.

Now, here are the questions:

1. Is there a tutorial / HowTo / … that explains the best strategy for
parallelization with several nested loops?

2. does the first index of the variable or the last have to be the innermost loop?

3. what is a good strategy for parallelization of the I,J,K loops (the
computational grid) with respect to Multiprozessors and thread processors?

4. is “vector” length the length of the parts of the (I)-loop, that are computed sequentially on one core? And Iend/length cores are running in parallel then for the I-loop?

5. The grid block size: does it make sence to make small grid blocks that
fit in the cache or is it better to have big blocks in all 3 dimensions? Or
is it better to have long slices with large I compared to J and K -> long
vector?

6. Does it make sence to have as much code as possible within the innermost loop or is it better to divide the code and have seperate loops, all over I,J,K with only one or two statements in the innermost loops?

OK, I know that are a lot of questions, so: many thanks in advance for any hint!

jockel,

There is no tutorial/HowTo that explains the best strategy to the best of our knowledge.

We have found that performance is usually better when making the “I” loop the
“vector” loop. Make the threadblocks have dimensions like (256,1,1) which correspond
to the “vector” loop. Then make the “J” and “K” loops the parallel loops and make them

regards,
dave

Hi Jockel,

1. Is there a tutorial / HowTo / … that explains the best strategy for
parallelization with several nested loops?

What might be useful for you is to take a look at our paper and source code for the Himeno benchmark which you can find in the PGI Accelerator Files Page (http://www.pgroup.com/resources/accel_files/index.htm) near the bottom. Himeno is a 3D Poisson solver and has a similar structure to your code.

1. does the first index of the variable or the last have to be the innermost loop?

The first index should correspond to the “vector” in order to get stride-1 memory accesses across threads. Typically, the innermost loop does correspond to at least one dimension of the vector.

1. what is a good strategy for parallelization of the I,J,K loops (the
computational grid) with respect to Multiprozessors and thread processors?

I would first let the compiler determine the schedule and then start making adjustments from there. Though, it’s been my experience that the compiler does usually find the optimal schedule and remove my directives. Most likely, it will try to create a 16x16 vector (thread block) from the I and J loops and then create a 2-D parallel grid (blocks) from K and J.

Note for the second loop, I would move the “is” loop to the innermost, running it sequentially in your kernel. As it is now, it’s not parallel. Something like:

``````
do K = 1, Kend
do J = 1, Jend
!\$acc do kernel
do I = 1, Iend
do is = 1, ns
bet1(I,J,K)
&             = bet1(I,J,K)
&             + xi(I,J,K,is,MM)*Rho(I,J,K,MM) * Cvtr1(is)
end do
end do
end do
end do
``````

Note, that for the third example loop, you do not want to move is to the innermost since you added fourth dimension.

1. is “vector” length the length of the parts of the (I)-loop, that are computed sequentially on one core? And Iend/length cores are running in parallel then for the I-loop?

No. On an NVIDIA GPU, “vector” corresponds to the number of threads in a thread block. I recommend reading this article on the CUDA Data Parallel Model (http://www.pgroup.com/lit/articles/insider/v2n1a5.htm).

1. The grid block size: does it make sence to make small grid blocks that
fit in the cache or is it better to have big blocks in all 3 dimensions? Or
is it better to have long slices with large I compared to J and K -> long
vector?
2. Does it make sence to have as much code as possible within the innermost loop or is it better to divide the code and have seperate loops, all over I,J,K with only one or two statements in the innermost loops?

I’m going to lump these two questions together. The question of large kernel versus small kernel is not easily answered since it is very dependent upon your algorithm.

Ideally, you want to achieve 100% “occupancy” (http://forums.nvidia.com/index.php?showtopic=31279) which is the ratio of active warps versus the maximum number supported by your GPU. The limiting factors are the amount to shared memory available, the number of registers, and the number of threads per block. Smaller kernels tend to use less resources, therefore you can increase the number of threads per block, and increase the occupancy. However, small kernels tend to mean that you need to run more of them in order to solve your algorithm.

With larger kernels, you tend to use more resources so can not run as many threads per block and reduces the occupancy. However, doing more work per kernel will mean fewer kernels and less overhead.

Hope this helps,
Mat