marking some arrays as pinned

hello,

For some reasons I want to register several arrays in my code as pinned memory. I’d like to mix cudaFortran and ACC.

In the simplest case when I have

       real (8), dimension(:,:), allocatable :: a, b, c
       integer n,m

       n = 2000
       m = 2000

       allocate( a(n,m) )
       allocate( b(n,m) )
       allocate( c(n,m) )

      do i = 1,n
         do j = 1,m
            a(i,j) = 100*i + j
            b(i,j) = dsin(i*j*3.1415d0)
         enddo
      enddo


!$acc data create(a,b,c)
!$acc update device(a,b) async
...

and compiler’s options

pgfortran -acc -ta=nvidia,cc35,pin -Mcuda -Minfo=accel test_ord.F -o test_ord

everyting works fine and I see ‘a’ and ‘b’ arrays are passed to GPU in two data movement ops. Without copying them to an internal buffer.

If I modify the code as

  1        use cudafor
  2        real (8), dimension(:,:), allocatable :: a, b, c
  3        integer n,m,ierr,f
  4
  5        n = 2000
  6        m = 2000
  7
  8        ierr = cudaSetDeviceFlags(cudaDeviceMapHost)
  9        allocate( a(n,m) )
 10        allocate( b(n,m) )
 11        allocate( c(n,m) )
 12
 13        ierr = cudaHostRegister(C_LOC(a),n*m*8,cudaHostRegisterPortable)
 14        ierr = cudaHostRegister(C_LOC(b),n*m*8,cudaHostRegisterPortable)
 15        ierr = cudaHostRegister(C_LOC(c),n*m*8,cudaHostRegisterPortable)
 16
 17       do i = 1,n
 18          do j = 1,m
 19             a(i,j) = 100*i + j
 20             b(i,j) = dsin(i*j*3.1415d0)
 21          enddo
 22       enddo
 23
 24
 25 !$acc data create(a,b,c)
 26 !$acc update device(a,b) async

and use command line

pgfortran -acc -ta=nvidia,cc35 -Mcuda -Minfo=accel test_ord_cuda.F -o test_ord_cuda

I see each array passed with two or three operations (depends on PGI_ACC_BUFFERSIZE)

Why does cudaHostRegister() won’t turn array memeory into pinned one?
What do I do wrong?

Hi Alexey,

So while the CUDA “cudaHostRegister” run time call does pin the memory, but that information isn’t propagated to the PGI OpenACC run time. Hence, the OpenACC runtime goes ahead an uses the buffers.

Instead of calling “cudaHostRegister”, can you try adding the CUDA Fortran “pinned” attribute to “a”, “b”, and “c” declaration? This would allow the PGI run time to handle the memory pinning.

  • Mat

Hi Mat,

Thanks for your answeres.

Yes, it works with “pinned” attribute at declaration of arrays. Initial idea was not to rewrite a lot of code…

Why not to check memory attributes right before H2D/D2H operation at run time?

Alexey

Mat,

one more question.
Did I get right that “acc update … async” makes compiler to insert instructions to copy data from my array to internal buffer (pinned memory) and then call cuMemcpyAsync? So, only the final part is performed asynchronously while copying data from my array to internal buffer is performed in the same CPU thread.

In my code I have

!$acc update host(....) async(123)
 call foo() ! I have several kernels here
!$acc wait(123)

NVVP shows that mainly there is no overlapping of transferring data from GPU to CPU, and only the last chunk of data (one of dozens) is passed the same time the first kernel from foo() is running.

Hi Alexey,

Why not to check memory attributes right before H2D/D2H operation at run time?

Doing so would add run time overhead to every copy. Given this is a very limited use case and one that has recommended alternative solutions, I don’t think we could justify the added overhead.

Yes, it works with “pinned” attribute at declaration of arrays. Initial idea was not to rewrite a lot of code…

Good. Using the “pin” flag or “pinned” attribute should help with reducing amount of code needed to be rewritten.

Did I get right that “acc update … async” makes compiler to insert instructions to copy data from my array to internal buffer (pinned memory) and then call cuMemcpyAsync?

“async” in this context is if the data copy is performed asynchronous with respect to the host. When copying to the device, the host continues while a helper CPU thread manages the data movement and does use cuMemcpyAsync.

Unfortunately due to technical reasons, copies to the host are not performed asynchronously. We have have tried multiple implementations, but none worked to our satisfaction. The only reliable method we found was to have the host wait until the copy completed. This is most likely what you’re seeing in NVVP.

  • Mat