Unified memory

Dear All,

Could let me know if the “managed” attributes should be declared in dummy arguments ?
I’d like to use the “unified memory”, but I don’t know if my source code is correct. Is it possible for you to check if it is correct ?

Here is my source code.

module gpu_kernel
use cudafor
implicit none

contains
attributes(global) subroutine stream_triad(n,a,b,c)
integer(kind=4),value :: n
real(kind=8),dimension(n),intent(in),managed :: a,b
real(kind=8),dimension(n),intent(inout),managed :: c

integer(kind=4) :: i,idx
real(kind=8) :: scalar

scalar = 0.5d0 * a(1)

idx = threadIdx%x + (blockIdx%x - 1) * blockDim%x

do i = idx,n,blockDim%xgridDim%x
if( i <= n ) then
c(i) = a(i) * scalar
b(i)
end if
end do
end subroutine stream_triad
end module gpu_kernel

program unified_memory_test
use gpu_kernel
implicit none

integer(kind=4),parameter :: n=55000000,offset=0,ndim=n+offset,ntimes=10,nbpw=8
integer(kind=4),parameter :: nthreads = 128
integer(kind=4) :: i,j
integer(kind=4) :: nblocks
real(kind=8),dimension(:),allocatable,managed :: a,b,c

nblocks = (n + nthreads - 1) / nthreads

allocate(a(n))
allocate(b(n))
allocate(c(n))

do i = 1, n
a(i) = 2.0d0
b(i) = 0.5d0
c(i) = 0.0d0
end do

call stream_triad<<<nblocks,nthreads>>>(n,a,b,c)
cudaDeviceSynchronize()

deallocate(a)
deallocate(b)
deallocate(c)

end program unified_memory_test

Hi KOUCHI_Hiroyuki,

Could let me know if the “managed” attributes should be declared in dummy arguments ?

It’s extraneous but shouldn’t hurt. You only need to add “managed” to the declaration of the variables, which you have in the main subprogram.

The only error I see here is with the call to “cudaDeviceSynchronize()”. Since this is a function, you need to capture the return value. Something like the following where “rc” is an integer.

rc = cudaDeviceSynchronize()

Hope this helps,
Mat

% cat test.cuf
module gpu_kernel
use cudafor
implicit none

contains
attributes(global) subroutine stream_triad(n,a,b,c)
integer(kind=4),value :: n
real(kind=8),dimension(n),intent(in) :: a,b
real(kind=8),dimension(n),intent(inout) :: c

integer(kind=4) :: i,idx
real(kind=8) :: scalar

scalar = 0.5d0 * a(1)

idx = threadIdx%x + (blockIdx%x - 1) * blockDim%x

do i = idx,n,blockDim%x*gridDim%x
if( i <= n ) then
c(i) = a(i) * scalar*b(i)
end if
end do
end subroutine stream_triad
end module gpu_kernel

program unified_memory_test
use gpu_kernel
implicit none

integer(kind=4),parameter :: n=55000000,offset=0,ndim=n+offset,ntimes=10,nbpw=8
integer(kind=4),parameter :: nthreads = 128
integer(kind=4) :: i,j
integer(kind=4) :: nblocks, rc
real(kind=8),dimension(:),allocatable,managed :: a,b,c

nblocks = (n + nthreads - 1) / nthreads

allocate(a(n))
allocate(b(n))
allocate(c(n))

do i = 1, n
a(i) = 2.0d0
b(i) = 0.5d0
c(i) = 0.0d0
end do

call stream_triad<<<nblocks,nthreads>>>(n,a,b,c)
rc = cudaDeviceSynchronize()
do i = 1, 10
  print *, i, c(i)
enddo
deallocate(a)
deallocate(b)
deallocate(c)

end program unified_memory_test
% pgf90 -Mcuda=cc70 -V18.7 test.cuf; a.out
            1    1.000000000000000
            2    1.000000000000000
            3    1.000000000000000
            4    1.000000000000000
            5    1.000000000000000
            6    1.000000000000000
            7    1.000000000000000
            8    1.000000000000000
            9    1.000000000000000
           10    1.000000000000000

Dear Mat-san,

Thank you for your reply.

Sincerely yours,