Efficient Parallelization

Hi,
I have written the following code (skeleton) with OpenAcc directives:

        !$acc data copyin(...)
        !$acc data copyout(...)
        !$acc data create(...)
        !$acc parallel
        !$acc loop independent private(...)
        do i = 1,ni
                A(i)=...
                !$acc loop seq
                do j=1,nj
                       B(j) = depends on A
                end do
                ...stuff...
                !$acc loop seq
                do k=1,nk
                        C(k) = depends on B
                end do
        end do
	    !$acc end parallel
	    !$acc end data
	    !$acc end data
	    !$acc end data

I feel I am not using the full power of the GPU with the above code.
Do you have any recommendations to parallelize this efficiently where adjacent loops need to be executed sequentially?
Cheers,
Jyoti

Hi Jyoti,

I’m assuming that B and C are private arrays since the code would have race conditions otherwise.

In this case you can use vector parallelism on the inner loops. There’s an implicit block after the vector loop so you don’t need to worry about the dependencies between the two loops. Plus gang private arrays if the size is know at compile time and they fit, the compiler will put them into shared memory which can improve performance.

Now whether you want to parallelize the loops, I don’t know. It would depend on the loop trip counts of the inner, what’s in “stuff”, and if there’s something else that would prevent parallelism.

Something else to consider is data movement. Here you have the data regions directly around the compute region meaning the device data will be created and copied each time this is executed. Ideally you want to move the data regions to higher levels in the code and used across multiple compute regions. The goal being to move the data to the device once, and selectively bring it back to the host only when necessary.

If you want to provide a minimal reproducing example that I can build and execute, I’ll be better able to give recommendations.

-Mat

Hi Mat,

Thanks again for the detailed response. There is a lot of useful information in there. What I found most useful is that “There’s an implicit block after the vector loop so you don’t need to worry about the dependencies between the two loops”.

Data movement is in the back of my mind constantly as I port the code to OpenAcc. However, that is not my primary goal now. My aim right now is to get the code working accurately on a GPU. Once that is achieved, I plan to work on optimizing data transfers and get the most out of the GPUs.

At this point I am not sharing a reproducible example because your response helped solve the issue. :) Once I get to optimizing data transfer, I might write an example because pseudo codes might not suffice that time.

Cheers,
Jyoti

Hi Mat,

I just wanted to thank you for your invaluable help. Depending on the size of the problem, the OpenACC-based GPU code has now increased speed from anywhere between 2x and 10x (compared to my MPI-based CPU code). Additionally, power consumption has been cut in half!
The code porting without your help would have taken much longer and so thank you!

Cheers,
Jyoti

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.