# Task Parallelism using Accelerators directives

Hi,
I have two independent tasks in my code like in the example below.
How can I instruct the compiler to execute them in parallel and how can
force a synchronization at the end of the two tasks?
Example:

! sum of two matrix A and B, task 1
do i=1, 512
do j=1, 512
C(i,j) = A(i,j) + B(i,j)
enddo
enddo
!
! multiplication of two matrix A and B, task 2
do i = 1, 512
do j = 1, 512
D(i,j) = 0
enddo
enddo
do i = 1, 512
do j = 1, 512
do k = 1, 512
D(i,j) = D(i,j) + A(i,k)*B(k,j)
enddo
enddo
enddo
! Need syncronization
! sum of two matrix D and C results of preceding tasks
do i=1, 512
do j=1, 512
E(i,j) = D(i,j) + C(i,j)
enddo
enddo

Hi Fedele.Stabile,

Use the “async” clause and then add a “wait” directive to synchronise.

Something along the lines of:

``````integer(4) :: handle

handle = 1

!\$acc data region copyin(A,B), copyout(E), local(C,D)

! sum of two matrix A and B, task 1
!\$acc region async(handle)
do i=1, 512
do j=1, 512
C(i,j) = A(i,j) + B(i,j)
enddo
enddo
!\$acc end region
!
! multiplication of two matrix A and B, task 2
!\$acc region async(handle)
do i = 1, 512
do j = 1, 512
D(i,j) = 0
enddo
enddo
do i = 1, 512
do j = 1, 512
do k = 1, 512
D(i,j) = D(i,j) + A(i,k)*B(k,j)
enddo
enddo
enddo
!\$acc end region

!\$acc wait(handle)
! Need syncronization
! sum of two matrix D and C results of preceding tasks
!\$acc region
do i=1, 512
do j=1, 512
E(i,j) = D(i,j) + C(i,j)
enddo
enddo
!\$acc end region
!\$acc end data region
``````

Note that the use of a “handle” is optional.

• Mat

Hi Mat,

Would it be possible for you to add an asynchronous data transfer into this example?

I have tried the following code but the CUDA_PROFILE output indicates that the transfer of F occurs on the same streamid as the transfers of the other arrays rather than the same stream as the kernel executions.

The method used for the data transfer also appear to be the blocking “memcpyHtoD” rather than the asynchronous version I would have anticipated.

BTW there seems to be an issue when compiling Accelerator code that uses the async clause and the -Mcuda flag. The profiling output indicates that all kernels and transfers are executed on streamid 0 in this case.

Karl

``````program asynctest
integer, dimension(:,:) :: A(512,512)
integer, dimension(:,:) :: B(512,512)
integer, dimension(:,:) :: C(512,512)
integer, dimension(:,:) :: D(512,512)
integer, dimension(:,:) :: E(512,512)
integer, dimension(:,:) :: F(512,512)
!\$acc mirror(F)
integer(4) :: handle

handle = 1

!\$acc data region copyin(A,B), copyout(E), local(C,D)

!\$acc update device(F) !\$acc async(handle)

! sum of two matrix A and B, task 1
!\$acc region async(handle)
!\$acc do
do i=1, 512
do j=1, 512
C(i,j) = A(i,j) + B(i,j)
enddo
enddo
!\$acc end region

! multiplication of two matrix A and B, task 2
!\$acc region async(handle)
!\$acc do
do i = 1, 512
do j = 1, 512
D(i,j) = 0
enddo
enddo
do i = 1, 512
do j = 1, 512
do k = 1, 512
D(i,j) = D(i,j) + A(i,k)*B(k,j)
enddo
enddo
enddo
!\$acc end region

!\$acc wait(handle)
! Need syncronization

! sum of two matrix D and C (results of preceding tasks)
!\$acc region
!\$acc do
do i=1, 512
do j=1, 512
E(i,j) = D(i,j) * C(i,j)
enddo
enddo
!\$acc end region
!\$acc end data region
end program asynctest
``````

[/code]