Using C-pointer for 3D Fortran array

I am using CUDA Fortran. In my code, I am using linear array in GPU, and then copy it back to CPU using a linear array. This array is then RESHAPE() into a 3D array.

This is slow as I need 2 copies: one from GPU and one from CPU linear to CPU 3D array.

My question is if I can just declare a C-pointer to the 3D array, and then using cuda copy to copy linear array to this C-pointer which is pointing to the first element in the 3D array?

Thanks,
Tuan

Hi Tuan,

My question is if I can just declare a C-pointer to the 3D array, and then using cuda copy to copy linear array to this C-pointer which is pointing to the first element in the 3D array?

Sure. You can use “integer(c_devptr)” for CUDA C device pointers and “integer(c_ptr)” for host C pointers. Then use “c_f_pointer” to create the association between you 3D arrays and the C pointer. You can grab the C pointer from a Fortran host array using “c_loc”. I wrote a basic example below.

Though, I’m wondering if you really need to do all this. If you’re just using CUDA Fortran and your arrays are contiguous, a simple “arr1 = arr2” statement will perform the same thing.

Hope this helps,
Mat

% cat test.cuf

module test_mod
  integer, device, allocatable :: x(:,:,:)
  contains
    attributes(global) subroutine kernel(l,m,n)
    integer, value :: l, m, n
    i = (blockidx%x-1)*blockdim%x + threadidx%x
    j = (blockidx%y-1)*blockdim%y + threadidx%y
    if (i <= l .and. j <= m ) then
       do k=1,N
         x(i,j,k) = ((i-1)*L*M)+((j-1)*M)+k
       enddo
    end if
    return
    end subroutine
end module test_mod

program test
use cudafor
use test_mod
implicit none
integer, parameter :: L = 64
integer, parameter :: M = 64
integer, parameter :: N = 64
integer size
integer istat
integer, target, allocatable, dimension(:,:,:) :: y
type(dim3) :: blocks
type(dim3) :: threads
type(c_ptr) :: h1
type(c_devptr) :: cd1

allocate(y(L,M,N))
size = L*M*N
istat = cudaMalloc(cd1, size*4)
call c_f_pointer(cd1, x, (/ L, M, N /))

blocks = dim3((L+31)/32, (M+31)/32, 1)
threads = dim3(32, 32, 1)

call kernel<<<blocks,threads>>>(L,M,N)
istat = cudaThreadSynchronize()
print *, cudaGetErrorString(cudaGetLastError())

h1 = c_loc(y)
istat = cudaMemcpy(h1, cd1, size*4, cudaMemcpyDeviceToHost)

print *, y(1,1,:)
print *, y(64,64,:)
deallocate(y)
istat = cudaFree(cd1)
end
% pgf90 test.cuf ; a.out
 no error
            1            2            3            4            5            6
            7            8            9           10           11           12
           13           14           15           16           17           18
           19           20           21           22           23           24
           25           26           27           28           29           30
           31           32           33           34           35           36
           37           38           39           40           41           42
           43           44           45           46           47           48
           49           50           51           52           53           54
           55           56           57           58           59           60
           61           62           63           64
       262081       262082       262083       262084       262085       262086
       262087       262088       262089       262090       262091       262092
       262093       262094       262095       262096       262097       262098
       262099       262100       262101       262102       262103       262104
       262105       262106       262107       262108       262109       262110
       262111       262112       262113       262114       262115       262116
       262117       262118       262119       262120       262121       262122
       262123       262124       262125       262126       262127       262128
       262129       262130       262131       262132       262133       262134
       262135       262136       262137       262138       262139       262140
       262141       262142       262143       262144

The reason I’m doing this is that I use a linear representation of a 3D array in kernel. However, when doing some calculation I only need to use a smaller part of the array (i.e. avoiding boundary or stencil gridpoints) so I need to reshape the array into 3D and extract the inner part of the array.

Currently, I need to copy the data back, then RESHAPE the array to use 3D-indexing to extract the data in the smaller subarray easier.

Do you have a better approach for this?

Thanks,
Tuan

My another question. If I use C-pointer in Fortran code pointing to data allocated by Fortran which is column-based, so when I use the C-pointer should the compiler treats it as row-based like in C or column-based like in Fortran?

Hi Tuan,

You can’t access the c_ptr directly in Fortran but instead it’s associated Fortran variable. For example in the above program, you can say “y(1,1,1)” but not “h1(1)”.

Currently, I need to copy the data back, then RESHAPE the array to use 3D-indexing to extract the data in the smaller subarray easier.

Do you have a better approach for this?

Have you tired using F90 pointers?

  • Mat