Implicit async memcpy (updated with an example)

pilot117 · July 10, 2012, 5:10pm

Hi,

In the fortran code, if I only use the openAcc, and the “mirror” to pass global device arrays across several subroutines, any updates from host to device using “update” WITHOUT “async”, it generates only one stream.

However, when I pass the flag “-Mcuda” into the compiler, without “update”, there is still one stream. But when use “update” (even without “async”), there are two streams launched and one stream is responsible for the memory transfer.

Why is that? How can force one stream but still use -Mcuda?

P.S: In my application, this async memory transfer can not reduce the time since no computations and memcpys are overlapped. Moreover, this two streams result in some overhead which is not small enough compared with the pure computation.

-----------------------------updated with an example here---------------------
Below is an fotran 77 sample code called simpleTest.f. My compile command line is

pgf90 -o test -acc -mp -ta=nvidia:cc2.0,time -Minfo=accel -Mcuda -Mvect simpleTest.f

subroutine mysub()
      integer N,i
      parameter (N=1048576)
      common/blk/t1(N),t2(N),t3(N)
!$acc update device(t1,t2,t3)
!$acc kernels
!$acc loop
      do i=1,N
         t2(i)=t1(i)*t1(i)+t2(i)*t3(N)
      enddo
!$acc end kernels
!$acc update host(t2)
      return
      end

      program mainTest
      integer N,i
      parameter (N=1048576)
      real t1(1:N),t2(1:N),t3(1:N)
      common/blk/t1,t2,t3

!$acc mirror(t1,t2,t3)

      do i=1,N
         CALL RANDOM_NUMBER(HARVEST=X)
         t1(i)=X 
         CALL RANDOM_NUMBER(HARVEST=X)
         t2(i)=X 
         CALL RANDOM_NUMBER(HARVEST=X)
         t3(i)=X 
      enddo
  

      do j=1,10
      call mysub()
      end do 

      end program mainTest

When you run the execute “test” using nvvp, you will see two streams like

however, if I can change

t2(i)=t1(i)*t1(i)+t2(i)*t3(i)

It uses one stream again.

thanks!

MatColgrove · July 12, 2012, 8:32pm

Hi pilot117,

Sorry of the delayed response. I needed to ask one of our compiler engineers about this. It turns out to be same problem as one we found internally a few weeks ago with stream assignment and asynchronous data movement. The correct behaviour here is to use a single stream. We will have this corrected in our next release.

Thanks!
Mat

pilot117 · August 8, 2012, 10:54pm

Hi, Mat,

I tried this simple example with the 12.6 compiler. It still runs two streams as before and the profiling timeline is similar as the one shown above.

Any idea?

thanks!

Topic		Replies	Views
Questions on Streams CUDA Programming and Performance	5	2160	July 16, 2008
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1433	January 23, 2009
Questions about "cudaMemcpyAsync" Legacy PGI Compilers	1	2374	November 18, 2011
update host async Legacy PGI Compilers	7	5919	February 22, 2013
Help with CUDA streams CUDA Programming and Performance	1	1618	April 2, 2010
SimpleStreams and Asyncmemory copies are slow CUDA Programming and Performance	4	2188	February 12, 2009
cudaMemcpyAsync CUDA Programming and Performance	10	21391	October 16, 2015
cudaMemcpyAsync code problem CUDA Programming and Performance	3	4578	September 16, 2008
a question about the asynchronous mechanism and stream CUDA Programming and Performance	3	1900	December 10, 2008
Memcpy3D fails in release Code runs in debug, not release CUDA Programming and Performance	2	4952	November 21, 2008

Implicit async memcpy (updated with an example)

Related topics