could you please explain behavior of “acc wait” directive.
For example, in my code I have
subroutine foo(...) curs = ... !$acc kernel async(curs) do i=.... ... end subroutine foo ... call foo(a,b,c) call foo(d,e,f) call foo(g,h,i) ... call foo(x,y,z) !$acc wait
curs takes limited number of values (9 in my case).
In nvvp I see 9 invocations of cuStreamSynchronize. PGI_ACC_DEBUG=1 shows that “acc wait” directive resulted in 9 invocations of acc_wait routine with different stream value.
Does it mean that if I use 10 threads somewhere in the beginning of my program and use only one after that, each “acc wait” directive will check each CUDA stream I ever used? It could result in performance problem I guess. Why does not to call cudaThreadSynchronize()?
In my case if all kernels completed, waiting for a stream takes 1.9us, waiting for all threads - 35us.