marking some arrays as pinned

ARom_nsk · August 10, 2015, 10:41am

hello,

For some reasons I want to register several arrays in my code as pinned memory. I’d like to mix cudaFortran and ACC.

In the simplest case when I have

       real (8), dimension(:,:), allocatable :: a, b, c
       integer n,m

       n = 2000
       m = 2000

       allocate( a(n,m) )
       allocate( b(n,m) )
       allocate( c(n,m) )

      do i = 1,n
         do j = 1,m
            a(i,j) = 100*i + j
            b(i,j) = dsin(i*j*3.1415d0)
         enddo
      enddo


!$acc data create(a,b,c)
!$acc update device(a,b) async
...

and compiler’s options

pgfortran -acc -ta=nvidia,cc35,pin -Mcuda -Minfo=accel test_ord.F -o test_ord

everyting works fine and I see ‘a’ and ‘b’ arrays are passed to GPU in two data movement ops. Without copying them to an internal buffer.

If I modify the code as

  1        use cudafor
  2        real (8), dimension(:,:), allocatable :: a, b, c
  3        integer n,m,ierr,f
  4
  5        n = 2000
  6        m = 2000
  7
  8        ierr = cudaSetDeviceFlags(cudaDeviceMapHost)
  9        allocate( a(n,m) )
 10        allocate( b(n,m) )
 11        allocate( c(n,m) )
 12
 13        ierr = cudaHostRegister(C_LOC(a),n*m*8,cudaHostRegisterPortable)
 14        ierr = cudaHostRegister(C_LOC(b),n*m*8,cudaHostRegisterPortable)
 15        ierr = cudaHostRegister(C_LOC(c),n*m*8,cudaHostRegisterPortable)
 16
 17       do i = 1,n
 18          do j = 1,m
 19             a(i,j) = 100*i + j
 20             b(i,j) = dsin(i*j*3.1415d0)
 21          enddo
 22       enddo
 23
 24
 25 !$acc data create(a,b,c)
 26 !$acc update device(a,b) async

and use command line

pgfortran -acc -ta=nvidia,cc35 -Mcuda -Minfo=accel test_ord_cuda.F -o test_ord_cuda

I see each array passed with two or three operations (depends on PGI_ACC_BUFFERSIZE)

Why does cudaHostRegister() won’t turn array memeory into pinned one?
What do I do wrong?

MatColgrove · August 10, 2015, 6:38pm

Hi Alexey,

So while the CUDA “cudaHostRegister” run time call does pin the memory, but that information isn’t propagated to the PGI OpenACC run time. Hence, the OpenACC runtime goes ahead an uses the buffers.

Instead of calling “cudaHostRegister”, can you try adding the CUDA Fortran “pinned” attribute to “a”, “b”, and “c” declaration? This would allow the PGI run time to handle the memory pinning.

Mat

ARom_nsk · August 11, 2015, 4:04am

Hi Mat,

Thanks for your answeres.

Yes, it works with “pinned” attribute at declaration of arrays. Initial idea was not to rewrite a lot of code…

Why not to check memory attributes right before H2D/D2H operation at run time?

Alexey

ARom_nsk · August 11, 2015, 12:55pm

Mat,

one more question.
Did I get right that “acc update … async” makes compiler to insert instructions to copy data from my array to internal buffer (pinned memory) and then call cuMemcpyAsync? So, only the final part is performed asynchronously while copying data from my array to internal buffer is performed in the same CPU thread.

In my code I have

!$acc update host(....) async(123)
 call foo() ! I have several kernels here
!$acc wait(123)

NVVP shows that mainly there is no overlapping of transferring data from GPU to CPU, and only the last chunk of data (one of dozens) is passed the same time the first kernel from foo() is running.

MatColgrove · August 11, 2015, 5:06pm

Hi Alexey,

Why not to check memory attributes right before H2D/D2H operation at run time?

Doing so would add run time overhead to every copy. Given this is a very limited use case and one that has recommended alternative solutions, I don’t think we could justify the added overhead.

Yes, it works with “pinned” attribute at declaration of arrays. Initial idea was not to rewrite a lot of code…

Good. Using the “pin” flag or “pinned” attribute should help with reducing amount of code needed to be rewritten.

Did I get right that “acc update … async” makes compiler to insert instructions to copy data from my array to internal buffer (pinned memory) and then call cuMemcpyAsync?

“async” in this context is if the data copy is performed asynchronous with respect to the host. When copying to the device, the host continues while a helper CPU thread manages the data movement and does use cuMemcpyAsync.

Unfortunately due to technical reasons, copies to the host are not performed asynchronously. We have have tried multiple implementations, but none worked to our satisfaction. The only reliable method we found was to have the host wait until the copy completed. This is most likely what you’re seeing in NVVP.

Mat

Topic		Replies	Views
Host Pinned Memory Allocation Legacy PGI Compilers	1	3072	January 24, 2011
OpenACC code with pinned memory Legacy PGI Compilers	4	4396	December 12, 2012
Pinned memory for asynchronous data transfer with OpenACC Legacy PGI Compilers	5	3403	February 6, 2019
Pinned Memory Allocations Legacy PGI Compilers	1	1620	June 5, 2012
cudaHostRegister and Fortran Legacy PGI Compilers	4	8281	February 8, 2013
PGI 13.2 compiler and pinned memory Legacy PGI Compilers	3	3050	March 21, 2013
Problems with cudaHostAlloc and cudaMemcpyAsync CUDA Programming and Performance	5	4575	February 8, 2010
Error when pass PINNDED data Legacy PGI Compilers	1	2337	February 17, 2011
Find out if host buffer is pinned (page-locked) CUDA Programming and Performance	4	2723	March 4, 2015
alloc of pinned memory has to be _after_ setting device Legacy PGI Compilers	3	5445	August 20, 2010

marking some arrays as pinned

Related topics