Question about unified memory in cuda fortran

Hello, I would like to ask about using unified memory in cuda fortran.
In calculation of 3D array, I want to extend array size 512^3 to 768^3.
But the memory of GPU is not enough for 768^3 computation.
(My GPU is Tesla P100 16GB, cuda version: 8.0, PGI compiler: 17.7, OS: Linux centOS)

So I set the array as managed array. And the brief of the code is written below.

Module cuda_kernel
Use cudafor

!Calculation of next time step phi as phi_out
attributes(global)subroutine calphi(—some variables—)

declare variables
Real(8),dimension(nx,ny,nz),device:: phi, phi_out

phi_out = phi + dt * ( ~~ )

end subroutine calphi

end module cuda_kernel

program main
use cudafor
use cuda_kernel

declare variables
Real(8),managed,allocatable(:,:,:):: phi, phi_out
<— I set 3 dimensional array as managed memory.

state = cudaSetDevice(0)
blocks = dim3(nx/16,nx/16,1)
threads = dim3(16,16,1)


set initial condition of phi
phi(i,j,k) = ~~~
state = cudaThreadSynchronize()

do timestep = 1,10000 —> iterative computation of time

call calphi<<<blocks,threads>>>(—some variables—)
state = cudaThreadSynchronize()

phi = phi_out —> update of array
state = cudaThreadSynchronize()

end do

state = cudaFree(phi)
state = cudaFree(phi_out)

end program main

When I set nx=ny=nz=512, the calculation was successfully executed.

But the problem is, when I set nx=ny=nz=768, the calculation of phi_out was not executed in entire domain.
(I checked the initial condition of phi was defined in entire domain by using synchronization)

And next problem occurs when I update array.
There was no trouble in the case of nx=ny=nz=512, on the other hand, when this value is 768,
error related to memory occurred as below.

0: copyover Memcpy (dst=0x0x700200000, src=0x0x7d8200000, size=3623878656) FAILED: 77(an illegal memory access was encountered)

How can I handle these problems?
I really appreciate if you help me to solve these problems!

I don’t think we have virtualized the GPU memory to where you can
assume it is of any size, and the compiler will handle the movement of
data to and from the physical memories and coordinate the computation.

I do hope we someday can get to this point, but it has not happened yet.


Thank you Dave for answering my question.

You mean, it is impossible to calculate arrays over GPU memory size by using unified memory, right?
Then do I have to use cuda + MPI strategy?

Thank you again!

I think the problem here is that you need to add “-Mlarge_arrays” or “-mcmodel=medium” since the arrays are over 2GB.

When using CUDA 8.0 or later with a P100 or V100, you should be able to oversubscribe the device memory. Also, since you’re using managed memory, there’s no need to keep both the “phi” and “phi_out” arrays. You can just use “phi” on both the host and device.