In the fortran code, if I only use the openAcc, and the “mirror” to pass global device arrays across several subroutines, any updates from host to device using “update” WITHOUT “async”, it generates only one stream.
However, when I pass the flag “-Mcuda” into the compiler, without “update”, there is still one stream. But when use “update” (even without “async”), there are two streams launched and one stream is responsible for the memory transfer.
Why is that? How can force one stream but still use -Mcuda?
P.S: In my application, this async memory transfer can not reduce the time since no computations and memcpys are overlapped. Moreover, this two streams result in some overhead which is not small enough compared with the pure computation.
-----------------------------updated with an example here---------------------
Below is an fotran 77 sample code called simpleTest.f. My compile command line is
pgf90 -o test -acc -mp -ta=nvidia:cc2.0,time -Minfo=accel -Mcuda -Mvect simpleTest.f
subroutine mysub() integer N,i parameter (N=1048576) common/blk/t1(N),t2(N),t3(N) !$acc update device(t1,t2,t3) !$acc kernels !$acc loop do i=1,N t2(i)=t1(i)*t1(i)+t2(i)*t3(N) enddo !$acc end kernels !$acc update host(t2) return end program mainTest integer N,i parameter (N=1048576) real t1(1:N),t2(1:N),t3(1:N) common/blk/t1,t2,t3 !$acc mirror(t1,t2,t3) do i=1,N CALL RANDOM_NUMBER(HARVEST=X) t1(i)=X CALL RANDOM_NUMBER(HARVEST=X) t2(i)=X CALL RANDOM_NUMBER(HARVEST=X) t3(i)=X enddo do j=1,10 call mysub() end do end program mainTest
When you run the execute “test” using nvvp, you will see two streams like
however, if I can change
It uses one stream again.