Asynchronous call from CPU

I understand that with CUDA kernels, control returns immediately to CPU once you call the kernel. So I would think you would be able to simultaneously do other tasks on the CPU while the kernel is running on the GPU.

Is it not true for OpenACC. I tried to time a call. Something like this.

 !$ t1 = omp_get_wtime()
    call acc_compute()
    !$ t2 = omp_get_wtime()
    write(*, *)  t2-t1

The function has OpenACC loops. If the control would immediately return to CPU, then only the overheads would be reported in t2-t1, but I see as I make the loops bigger, the time increases.

Are you using the “async” clause on your compute kernels? If not, add “async” with a queue number (like “async(1)”) to make the device computation asynchronous to the host. You can then use the “wait” directive to set the synchronization point.

  • Mat

Hi Mat,

Thanks for the answer. So, are you suggesting this.

 !$ t1 = omp_get_wtime()
    call acc_compute()
    !$ t2 = omp_get_wtime()

    write(*, *)  t2-t1
!$acc wait(1)

subroutine acc_compute()

!$acc parallel loop async(1)
do i = 1, N
!Some computations
end do
!$acc end parallel

end subroutine acc_compute()

Correct.

Chapter 7 of the OpenACC best practices guide may be helpful since it gives a fuller explanation and examples. http://www.openacc.org/sites/default/files/OpenACC_Programming_Guide_0.pdf

Thanks. I had yet another question.

How would the following work. I have a big loop inside which there is another loop and then a DGEMM call. How would I go about if I want each outer loop to go. I ask because I don’t see how DGEMM calls can be done on the same stream.

do i = 1, N
  !$acc parallel do async(i)
  do j = 1, M
    Fd(i*M + j) = something
  end do
  !$acc end parallel
  
  !$acc host_data use_device(Lf, Fd)
  call DGEMM(Lf, Fd)
  !$acc end host_data

end do

!$acc wait

end subroutine acc_compute()

Hi vsingh96824,

You need to get the cuBLAS CUDA handle, the OpenACC stream associated with async queue, and then call cublasSetStream to so the two use the same stream.

For example:

#ifdef _OPENACC
type(cublasHandle) :: h
h = cublasGetHandle()
#endif
...
 !$acc update device(a,b,c) async(3)
   #ifdef _OPENACC
   istat = cublasSetStream(h,acc_get_cuda_stream(3))
   #endif
!$acc host_data use_device(a,b,c)
   call dgemm(transa,transb,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)
!$acc end host_data
  • Mat

Thanks Mat.