I have a question about reading and writing memory in an unstructured mannor in a accelerator region:
I noticed that if I read the memory of an array (A3_GPU) in a loop unordered, than the loop is still parallelizable and the performance is not that bad:
Q1 will have unordered values, eg:
i=1 : Q1=2345
i=2 : Q1=12
i=3 : Q1=18474
and so on…
!reading unstructured
!$acc region
do i = 1,100000
Q1 = A1_GPU(KP,1)
A2_GPU(i,3) = A3_GPU(Q1,1)
end do
!$acc end region
But when I want to write an array (B2_GPU) with an unstructured pattern, than the compiler forces the loop to execute sequentially on the device (!$acc do sec), which gives me very bad performance.
The loop looks like the following, and K1 is unordered, eg:
i=1 : K1=2345
i=2 : K1=12
i=3 : K1=18474
!writing unstructured
!$acc region
do i = 1,100000
K1 = B1_GPU(KP,1)
B2_GPU(K1,3) = B3_GPU(i,1)
end do
!$acc end region
Is there any workaround? Or just a possibility to tune such a loop?
What does the “width mean” if I use the directive: !$acc do sec [(width)]?
Copying the data to the host and executing the loop on the CPU and copying it back to the device is not an option, this would take more tme I guess.
Thank you very much![/quote]