I have a new Telsa C2070. I asked suuport if PGI fortran would support Virtual unified address space with device in TCC mode. They sent me a sample code which they said did not work. I modiied the code slightly.They had used cudaHostAlloc. I used cudaMallocHost that does not take flags. This gives a Cptr which can be converted to a Fortran pointer for use on the host. You can do what you want with this memory on the host.
You then use the pinned memory function cudaHostGetDevicePointer to convert the host pointer to a device pointer. This is a Fortran pointer that has to be converted to a Cdevptr now. The result can be passed to a kernel and used there. It essentially works like pinned memory but without the attributes pinned. As I have described it it is not exactly as it woould be in C for cuda. C for cuda does not distinguish device and host pointers. But PgiFortan does. Thus the need for the conversion. This only works with Tesla 2000 series boards which have been set into TCC mode. Also it only works with Windows 7. These are nVidia restrictions. But in principle this method gives the device access to all of the hosts memory and its own memory in a single memory space. if i switch device to a GTX 480 it does not work This convinces me it is not simply pinned memory in disguise.
The host array is declared as a pointer, real. dimension,pointer::, the device array is defined as allocatable, real, dimension, device, allocatable:: . This was what support sent me in their example.
Here is the program that we used to demonstrate unified-virtual-address space.
module m
implicit none
contains
! Square the entries of 'a'
attributes(global) subroutine fillit( a )
real, dimension(*) :: a
integer i
i = (blockidx%x-1)*blockdim%x + threadidx%x
a(i) = a(i) * a(i)
end subroutine
end module
program main
use, intrinsic :: iso_c_binding
use cudafor
use m
implicit none
real, dimension(:), pointer :: ha ! host pointer
type(c_ptr) :: locha ! C pointer which we convert to ha
real, dimension(:), device, allocatable :: da ! device pointer
type(c_devptr) :: locda ! C pointer which we convert to da
integer :: i, n, err, istat
real :: e
! Only works on a Fermi
istat = cudasetdeviceflags(cudaDeviceMapHost)
n = 2048
! Allocate mapped, pinned host memory
istat = cudahostalloc( locha, sizeof(1.0)*n, cudaHostAllocMapped )
call c_f_pointer( locha, ha, (/ n /) )
do i = 1, n
ha(i) = i+2
enddo
! convert the host pointer to a device address
istat = cudahostgetdevicepointer( locda, locha, 0 )
call c_f_pointer( locda, da, (/ n /) )
call fillit<<<16>>>( da )
! Wait for the kernel to finish before testing the results
istat = cudaThreadSynchronize();
! At this point, da(i) (and ha(i)) should be (i+2)**2
err = 0
do i = 1, n
e = float(i+2)
e = e * e
if( ha(i) - e .ne. 0 )then
err = err + 1
if( err <= 10 )then
write(*,10) i, ha(i), e
10 format( 'a(',i4,') = ', e14.8, ', expecting ', e14.8 )
endif
endif
enddo
print *, err, ' errors found'
print *, 'error code = ', cudaGetErrorString( cudaGetLastError() )
end program