Loop optimization question

Hello,

I’m trying to understand the concepts of effective loop parallelization in PGI Accelerator.
I read about “parallel” and “vector” directives. If I understand right, “parallel” clause means that the iterations will be executed simultaneously on the accelerator. Number of concurrently executed iterations cannot be greater than number of cores on the GPU, right?
The “vector” clause means that the iterations will be executed simultaneously but with synchronization. So, there will be some synchro across iterations. Shouldn’t it slow down the computations a bit?
What is the parameter of the “vector” clause? It determines “how many iterations are in a vector”. But what does it mean? Can it be larger than number of GPU cores?

Then, when I try to accelerate a simple loop like:

!$acc region do parallel
      do i=1,n
          a[i] = a[i]+2
      enddo
!$acc end region

(assuming that a is initialized earlier) I got as a result:

“Non-stride-1 access for array a”

Isn’t it a stride-1 access?

I’ve also tested an example code that I’ve found:

c Simple Loop Nest with Poor Cache Use:
!$acc region do parallel
do i=1,n
  do j=1,n
    a(i,j) = b(i,j)
  enddo
enddo
!$acc end region

c Reversed Loop Nest to Achieve Stride-1 Access
!$acc region do parallel
do j=1,n
  do i=1,n
    a(i,j) = b(i,j)
  enddo
enddo
!$acc end region

There is also the same message about “non-stride-1 access” in case of first and second loops. I see than when I don’t put the “parallel” directive, the compiler automatically adds “parallel, vector(…)”. Why?

Hi szczelba,

You’re understanding is a bit off about ‘parallel’ and ‘vector’ as applied to an NVIDIA GPU. Parallel corresponds to the Thread Block which are scheduled on a Streaming Multiprocessor while Vector corresponds to the Threads within a Block scheduled on the individual cores. This is a good primer on the NVIDIA threading model and should help.

  • Mat