"acc wait" behavior

Hi Mat!

could you please explain behavior of “acc wait” directive.

For example, in my code I have

subroutine foo(...)
curs = ...
!$acc kernel async(curs)
 do i=....
...
end subroutine foo

...

call foo(a,b,c)
call foo(d,e,f)
call foo(g,h,i)
...
call foo(x,y,z)

!$acc wait

curs takes limited number of values (9 in my case).

In nvvp I see 9 invocations of cuStreamSynchronize. PGI_ACC_DEBUG=1 shows that “acc wait” directive resulted in 9 invocations of acc_wait routine with different stream value.

Does it mean that if I use 10 threads somewhere in the beginning of my program and use only one after that, each “acc wait” directive will check each CUDA stream I ever used? It could result in performance problem I guess. Why does not to call cudaThreadSynchronize()?

In my case if all kernels completed, waiting for a stream takes 1.9us, waiting for all threads - 35us.

Alexey

Hi Alexy,

Let me ask Michael since I’m not positive on the implementation details here.

  • Mat

The acc_wait or !$acc wait or #pragma acc wait does indeed wait on all the open queues on the device with StreamSynchronize. The runtime uses the Driver API, so the corresponding Driver API routine you suggest would be cuCtxSynchronize. We will have to experiment with this to see if it impose any additional synchronizations. Thanks for the idea.