PGffortran and CUDA 4.0 with Tesla C2070 in TCC mode and UVD

I have a new Telsa C2070. I asked suuport if PGI fortran would support Virtual unified address space with device in TCC mode. They sent me a sample code which they said did not work. I modiied the code slightly.They had used cudaHostAlloc. I used cudaMallocHost that does not take flags. This gives a Cptr which can be converted to a Fortran pointer for use on the host. You can do what you want with this memory on the host.
You then use the pinned memory function cudaHostGetDevicePointer to convert the host pointer to a device pointer. This is a Fortran pointer that has to be converted to a Cdevptr now. The result can be passed to a kernel and used there. It essentially works like pinned memory but without the attributes pinned. As I have described it it is not exactly as it woould be in C for cuda. C for cuda does not distinguish device and host pointers. But PgiFortan does. Thus the need for the conversion. This only works with Tesla 2000 series boards which have been set into TCC mode. Also it only works with Windows 7. These are nVidia restrictions. But in principle this method gives the device access to all of the hosts memory and its own memory in a single memory space. if i switch device to a GTX 480 it does not work This convinces me it is not simply pinned memory in disguise.
The host array is declared as a pointer, real. dimension,pointer::, the device array is defined as allocatable, real, dimension, device, allocatable:: . This was what support sent me in their example.

Here is the program that we used to demonstrate unified-virtual-address space.

module m
 implicit none
 ! Square the entries of 'a'
 attributes(global) subroutine fillit( a )
  real, dimension(*) :: a
  integer i
  i = (blockidx%x-1)*blockdim%x + threadidx%x
  a(i) = a(i) * a(i)
 end subroutine 
end module

program main
 use, intrinsic :: iso_c_binding
 use cudafor
 use m
 implicit none
 real, dimension(:), pointer :: ha	! host pointer
 type(c_ptr) :: locha	! C pointer which we convert to ha
 real, dimension(:), device, allocatable :: da	! device pointer
 type(c_devptr) :: locda	! C pointer which we convert to da
 integer :: i, n, err, istat
 real :: e

 ! Only works on a Fermi
 istat = cudasetdeviceflags(cudaDeviceMapHost)

 n = 2048

 ! Allocate mapped, pinned host memory
 istat = cudahostalloc( locha, sizeof(1.0)*n, cudaHostAllocMapped )
 call c_f_pointer( locha, ha, (/ n /) )

 do i = 1, n
  ha(i) = i+2
 ! convert the host pointer to a device address
 istat = cudahostgetdevicepointer( locda, locha, 0 )
 call c_f_pointer( locda, da, (/ n /) )

 call fillit<<<16>>>( da )

 ! Wait for the kernel to finish before testing the results
 istat = cudaThreadSynchronize();

 ! At this point, da(i) (and ha(i)) should be (i+2)**2
 err = 0
 do i = 1, n
  e = float(i+2)
  e = e * e
  if( ha(i) - e .ne. 0 )then
   err = err + 1
   if( err <= 10 )then
    write(*,10) i, ha(i), e
10  format( 'a(',i4,') = ', e14.8, ', expecting ', e14.8 )
 print *, err, ' errors found'
 print *, 'error code = ', cudaGetErrorString( cudaGetLastError() )
end program