Implicit async memcpy (updated with an example)

Hi,

In the fortran code, if I only use the openAcc, and the “mirror” to pass global device arrays across several subroutines, any updates from host to device using “update” WITHOUT “async”, it generates only one stream.

However, when I pass the flag “-Mcuda” into the compiler, without “update”, there is still one stream. But when use “update” (even without “async”), there are two streams launched and one stream is responsible for the memory transfer.

Why is that? How can force one stream but still use -Mcuda?

P.S: In my application, this async memory transfer can not reduce the time since no computations and memcpys are overlapped. Moreover, this two streams result in some overhead which is not small enough compared with the pure computation.



-----------------------------updated with an example here---------------------
Below is an fotran 77 sample code called simpleTest.f. My compile command line is

pgf90 -o test -acc -mp -ta=nvidia:cc2.0,time -Minfo=accel -Mcuda -Mvect simpleTest.f



subroutine mysub()
      integer N,i
      parameter (N=1048576)
      common/blk/t1(N),t2(N),t3(N)
!$acc update device(t1,t2,t3)
!$acc kernels
!$acc loop
      do i=1,N
         t2(i)=t1(i)*t1(i)+t2(i)*t3(N)
      enddo
!$acc end kernels
!$acc update host(t2)
      return
      end

      program mainTest
      integer N,i
      parameter (N=1048576)
      real t1(1:N),t2(1:N),t3(1:N)
      common/blk/t1,t2,t3

!$acc mirror(t1,t2,t3)

      do i=1,N
         CALL RANDOM_NUMBER(HARVEST=X)
         t1(i)=X 
         CALL RANDOM_NUMBER(HARVEST=X)
         t2(i)=X 
         CALL RANDOM_NUMBER(HARVEST=X)
         t3(i)=X 
      enddo
  

      do j=1,10
      call mysub()
      end do 

      end program mainTest



When you run the execute “test” using nvvp, you will see two streams like

however, if I can change

t2(i)=t1(i)*t1(i)+t2(i)*t3(i)

It uses one stream again.





thanks!

Hi pilot117,

Sorry of the delayed response. I needed to ask one of our compiler engineers about this. It turns out to be same problem as one we found internally a few weeks ago with stream assignment and asynchronous data movement. The correct behaviour here is to use a single stream. We will have this corrected in our next release.

Thanks!
Mat

Hi, Mat,

I tried this simple example with the 12.6 compiler. It still runs two streams as before and the profiling timeline is similar as the one shown above.

Any idea?

thanks!