I understand that with CUDA kernels, control returns immediately to CPU once you call the kernel. So I would think you would be able to simultaneously do other tasks on the CPU while the kernel is running on the GPU.
Is it not true for OpenACC. I tried to time a call. Something like this.
!$ t1 = omp_get_wtime()
call acc_compute()
!$ t2 = omp_get_wtime()
write(*, *) t2-t1
The function has OpenACC loops. If the control would immediately return to CPU, then only the overheads would be reported in t2-t1, but I see as I make the loops bigger, the time increases.
Are you using the “async” clause on your compute kernels? If not, add “async” with a queue number (like “async(1)”) to make the device computation asynchronous to the host. You can then use the “wait” directive to set the synchronization point.
Hi Mat,
Thanks for the answer. So, are you suggesting this.
!$ t1 = omp_get_wtime()
call acc_compute()
!$ t2 = omp_get_wtime()
write(*, *) t2-t1
!$acc wait(1)
subroutine acc_compute()
!$acc parallel loop async(1)
do i = 1, N
!Some computations
end do
!$acc end parallel
end subroutine acc_compute()
Correct.
Chapter 7 of the OpenACC best practices guide may be helpful since it gives a fuller explanation and examples. http://www.openacc.org/sites/default/files/OpenACC_Programming_Guide_0.pdf
Thanks. I had yet another question.
How would the following work. I have a big loop inside which there is another loop and then a DGEMM call. How would I go about if I want each outer loop to go. I ask because I don’t see how DGEMM calls can be done on the same stream.
do i = 1, N
!$acc parallel do async(i)
do j = 1, M
Fd(i*M + j) = something
end do
!$acc end parallel
!$acc host_data use_device(Lf, Fd)
call DGEMM(Lf, Fd)
!$acc end host_data
end do
!$acc wait
end subroutine acc_compute()
Hi vsingh96824,
You need to get the cuBLAS CUDA handle, the OpenACC stream associated with async queue, and then call cublasSetStream to so the two use the same stream.
For example:
#ifdef _OPENACC
type(cublasHandle) :: h
h = cublasGetHandle()
#endif
...
!$acc update device(a,b,c) async(3)
#ifdef _OPENACC
istat = cublasSetStream(h,acc_get_cuda_stream(3))
#endif
!$acc host_data use_device(a,b,c)
call dgemm(transa,transb,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)
!$acc end host_data