Hi,
we want to accelerate a CFD code with the PGI accelerator model.
Unfortuneately, I couldn’t find a HowTo or extract a strategy for our code
out of the several documentations. Hopefully, someone can help me to reduce
my confusion ;).
So, here are the facts:

we have loops over the 3 dimensions of our grid: I,J,K with the length
of about 4 to 100 each. (loop length is not know at compile time) 
We use Fortran with the PGIAccelerator (11.5)

We have four C2050 cards
a code without dependancies to neighbouring cells would probably look like
do K = 1, Kend
do J = 1, Jend
do I = 1, Iend
rhoV2(I,J,K) = ( urho(I,J,K,MM)* urho(I,J,K,MM)
& + vrho(I,J,K,MM)* vrho(I,J,K,MM)
& + wrho(I,J,K,MM)* wrho(I,J,K,MM) )
& / rho(I,J,K,MM)
end do
end do
end do
do is = 1, ns
do K = 1, Kend
do J = 1, Jend
do I = 1, Iend
bet1(I,J,K)
& = bet1(I,J,K)
& + xi(I,J,K,is,MM)*Rho(I,J,K,MM) * Cvtr1(is)
end do
end do
end do
end do
a code with dependancies to neighbouring cells would probably look like
do is = 1, nh
do K = 2, Kre2(mm)1
do J = 2, Jre2(mm)1
do I = 2, Ire2(mm)1
dxdt(i,j,k,is) = dxdt(i,j,k,is)
& + XI_flux_rho_l(I,J,K,MM)
& * ( xi(i_lft(I,J,K,MM),J,K,is,MM) )
& + XI_flux_rho_r(I,J,K,MM)
& * ( xi(i_rght(I,J,K,MM),J,K,is,MM) )
& + ET_flux_rho_l(I,J,K,MM)
& * ( xi(I,j_lft(I,J,K,MM),K,is,MM) )
& + ET_flux_rho_r(I,J,K,MM)
& * ( xi(I,j_rght(I,J,K,MM),K,is,MM) )
& + ZE_flux_rho_l(I,J,K,MM)
& * ( xi(I,J,k_lft(I,J,K,MM),is,MM) )
& + ZE_flux_rho_r(I,J,K,MM)
& * ( xi(I,J,k_rght(I,J,K,MM),is,MM) )
enddo
enddo
enddo
enddo
where i_lft etc is the index of the left neighbour or the cell itself, depending on the fluxes.
Now, here are the questions:

Is there a tutorial / HowTo / … that explains the best strategy for
parallelization with several nested loops? 
does the first index of the variable or the last have to be the innermost loop?

what is a good strategy for parallelization of the I,J,K loops (the
computational grid) with respect to Multiprozessors and thread processors? 
is “vector” length the length of the parts of the (I)loop, that are computed sequentially on one core? And Iend/length cores are running in parallel then for the Iloop?

The grid block size: does it make sence to make small grid blocks that
fit in the cache or is it better to have big blocks in all 3 dimensions? Or
is it better to have long slices with large I compared to J and K > long
vector? 
Does it make sence to have as much code as possible within the innermost loop or is it better to divide the code and have seperate loops, all over I,J,K with only one or two statements in the innermost loops?
OK, I know that are a lot of questions, so: many thanks in advance for any hint!