I’m trying to understand the concepts of effective loop parallelization in PGI Accelerator.
I read about “parallel” and “vector” directives. If I understand right, “parallel” clause means that the iterations will be executed simultaneously on the accelerator. Number of concurrently executed iterations cannot be greater than number of cores on the GPU, right?
The “vector” clause means that the iterations will be executed simultaneously but with synchronization. So, there will be some synchro across iterations. Shouldn’t it slow down the computations a bit?
What is the parameter of the “vector” clause? It determines “how many iterations are in a vector”. But what does it mean? Can it be larger than number of GPU cores?
Then, when I try to accelerate a simple loop like:
!$acc region do parallel do i=1,n a[i] = a[i]+2 enddo !$acc end region
(assuming that a is initialized earlier) I got as a result:
“Non-stride-1 access for array a”
Isn’t it a stride-1 access?
I’ve also tested an example code that I’ve found:
c Simple Loop Nest with Poor Cache Use: !$acc region do parallel do i=1,n do j=1,n a(i,j) = b(i,j) enddo enddo !$acc end region c Reversed Loop Nest to Achieve Stride-1 Access !$acc region do parallel do j=1,n do i=1,n a(i,j) = b(i,j) enddo enddo !$acc end region
There is also the same message about “non-stride-1 access” in case of first and second loops. I see than when I don’t put the “parallel” directive, the compiler automatically adds “parallel, vector(…)”. Why?