Unstructured: Reading vs. Writing

elephant · August 17, 2011, 11:53am

I have a question about reading and writing memory in an unstructured mannor in a accelerator region:

I noticed that if I read the memory of an array (A3_GPU) in a loop unordered, than the loop is still parallelizable and the performance is not that bad:
Q1 will have unordered values, eg:

i=1 : Q1=2345
i=2 : Q1=12
i=3 : Q1=18474
and so on…

!reading unstructured
!$acc region
         do i = 1,100000     
             Q1 = A1_GPU(KP,1)
             A2_GPU(i,3) = A3_GPU(Q1,1)             
         end do                   
!$acc end region

But when I want to write an array (B2_GPU) with an unstructured pattern, than the compiler forces the loop to execute sequentially on the device (!$acc do sec), which gives me very bad performance.
The loop looks like the following, and K1 is unordered, eg:

i=1 : K1=2345
i=2 : K1=12
i=3 : K1=18474

!writing unstructured
!$acc region
         do i = 1,100000     
             K1 = B1_GPU(KP,1)
             B2_GPU(K1,3) = B3_GPU(i,1)             
         end do                   
!$acc end region

Is there any workaround? Or just a possibility to tune such a loop?
What does the “width mean” if I use the directive: !$acc do sec [(width)]?
Copying the data to the host and executing the loop on the CPU and copying it back to the device is not an option, this would take more tme I guess.

Thank you very much![/quote]

MatColgrove · August 17, 2011, 3:08pm

Hi elephant,

But when I want to write an array (B2_GPU) with an unstructured pattern, than the compiler forces the loop to execute sequentially on the device (!$acc do sec),

For a computed index, the compiler has no way of knowing at compiling time if all the values of K1 are unique. Hence, it must assume the worst case that all values of K1 are the same and therefore the loop is not safe to parallelize.

Is there any workaround?

Yes, the “independent” clause is your way of asserting to the compiler that all index values are independent of each other and it’s ok to parallelize.

!$acc region
!$acc do independent
         do i = 1,100000     
             K1 = B1_GPU(KP,1)
             B2_GPU(K1,3) = B3_GPU(i,1)             
         end do                   
!$acc end region

What does the “width mean” if I use the directive: !$acc do sec [(width)]?

It’s the size the compiler has strip mined the loop. Strip mining is when a small inner loop is created to work on small portions of the outer loop. This allows for variables to be stored in cache, or in the GPU case, shared memory.

Hope this helps,
Mat

elephant · August 17, 2011, 3:56pm

Very good! Thank you! Can’t wait to implement this “independent” clause and see the performance gain!!!
Excellent…