In order to parallelize your code, you need to make sure that all iterations of the loops are independent. In your case, ‘vvect’ has two dependencies.
The first is that you are using a computed index, ijk. If any of the values of ijk are the same for multiple threads, then the value of vvect(ijk,1) will depend on the order in which the threads are run. Since threads can be run in any order, your results will be non-deterministic. If you know that all values of ijk are unique, then you could use the “independent” clause to over ride the compilers dependency analysis and have the loop parallelized anyway.
However, even if all the values of ijkpr are unique, because multiple iteration of the outer loop access the same elements of ijkpr, you will definitely have multiple threads accessing the same elements of vvect. Again, the value of vvect will change from run to run.
Now you may think that due to the associativity of your calculation, the order in which vvect*rnorm is performed doesn’t really matter. The problem here is that this is not an atomic operation. Hence, you could the following scenario:
Thread 1 reads vvect(1,1) from memory
Thread 2 reads vvect(1,1) from memory
Thread 2 calculates vvect(1,1)*rnorm
Thread 2 writes the result back to memory
Thread 1 calculates vvect(1,1)*rnorm
Thread 1 writes the result back to memory
Though, all hope is not lost. Assuming you have enough computation, what you can do is manually privatiaze vvect (i.e. create another array with an additional dimension indexed using the parallel loop). You then need to perform a reduction to populate the original vvect array. Something like:
!$acc region copy(vvectP)
!$acc do kernel
!$acc end region
vvect(ijk,1) = vvect(ijk,1)+vvectTMP(nbl,ijk,1)
On side note, you do not need to use “local(n, nbl, ijk)”. Scalars are privatized (i.e. each thread has their own copy) by default. “local” is only needed for arrays that are local to the device.
Hope this helps,