Setting a pointer inside a cuda fortran kernel

Hello,

I am trying to use a pointer in a cuda fortran kernel, however the compiler won’t seem to let me. I made a simplified version of the kernel that reproduces the same errors.
The program

module cuda_kernels
    use cudafor
    implicit none
  contains
    attributes(global) subroutine process_slice(matrix, nrows, ncols, target_row)
      ! Kernel to process a particular row of the matrix
      real, device, intent(inout), target :: matrix(nrows,ncols)
      integer, value :: nrows, ncols, target_row
      integer :: idx_thread
      real, device, pointer :: row_ptr(:)
  
      idx_thread = threadIdx%x + (blockIdx%x - 1) * blockDim%x
  
      row_ptr => matrix(target_row, :)
      !  row_ptr(1:ncols) => matrix(target_row, :) ! this also fails
  
      if (idx_thread < ncols) then
        row_ptr(idx_thread) = row_ptr(idx_thread) + 1.0
      end if
    end subroutine process_slice

  end module cuda_kernels
  
  program test_cuda_pointers
    use cuda_kernels
    
    real, allocatable :: matrix_h(:,:)
    real, device, allocatable :: matrix_d(:, :)
    integer :: i, j

    integer, parameter :: nrows = 5
    integer, parameter :: ncols = 5

    allocate(matrix_h(nrows, ncols))
    allocate(matrix_d(nrows, ncols))
    
    ! Initialize the host matrix
    do i = 1, nrows
        do j = 1, ncols
        matrix_h(i, j) = real((i - 1) * ncols + j)
        end do
    end do

    ! copy the matrix from host to device
    matrix_d = matrix_h

    print *, "Matrix before processing:"
    print *, matrix_h

    ! Launch the kernel to increment each element of the second row by 1
    call process_slice<<<1, ncols>>>(matrix_d, nrows, ncols, 2)

    ! Copy the data back to the host
    matrix_h = matrix_d

    print *, "Matrix after processing the second row:"
    print *, matrix_h

    ! Deallocate the device matrix
    deallocate(matrix_d)
    deallocate(matrix_h)
  end program test_cuda_pointers

produces the following errors based on if row_ptr => matrix(target_row, :) or row_ptr(1:ncols) => matrix(target_row, :) is used.

The errors are

> nvfortran -O0 -cuda  -gpu=cc61,debug, -g   -o test_cuda_pointers test_cuda_pointers.f90
NVFORTRAN-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unsupported procedure (test_cuda_pointers.f90: 5)
NVFORTRAN/x86-64 Linux 24.7-0: compilation aborted

or

> nvfortran -O0 -cuda  -gpu=cc61,debug, -g   -o test_cuda_pointers test_cuda_pointers.f90
NVFORTRAN-S-0155-Call to compiler runtime library function not supported. pghpf_ptr_assn_i8 (test_cuda_pointers.f90: 5)
Unimplemented opcode: 0
NVFORTRAN-F-0000-Internal compiler error. Unimplemented opcode.       4  (test_cuda_pointers.f90: 5)
NVFORTRAN/x86-64 Linux 24.7-0: compilation aborted

I have looked in the documentation but have not found a place where it says that I can not do this with pointers, but perhaps I just missed it. I tried adding -MINFO=all, but that didn’t add anymore information to the output.

I will try this with the newest set of compilers when I have a chance.

Thank you for your help,
Josh

As you found, we don’t support general F90 pointer assignment in CUDA Fortran device code. Mostly, it is a bad idea to use F90 pointers this way in device code. The reason being, your kernel example above would spend > 90% of its time setting up the pointer descriptor infrastructure, and just a few cycles for each thread to update one element of the array. The pointer descriptor uses several registers in each thread’s register/local memory. On CPUs, the cost of the F90 pointer is amortized over an entire array’s worth of work. But on a GPU, each thread does on-the-order of 1 element of work, so it is just not worth supporting this feature, currently, unless we really could find a performant way to do so.

Thank you for the reply. That makes sense. In my actual kernel I am indeed register limited so based on your comment I understand why I wouldn’t want to use a pointer any way.

Do you have any recommendations for splitting up a larger scratch array of shared memory into smaller chunks within a kernel without pointers? I’d like to use the scratch array for several different purposes throughout the kernel but also want it to have a meaningful name at different sections of the code.

What f90 pointers give you (what you pay for) is the ability to have non-contiguous sections, which is what you had above. There is no good way to do that, other than manually keeping track of the stride yourself.

If you want to just alias sequential areas, especially in shared memory, there are two good ways. First, you can declare the shared arrays using (*) syntax, assumed size, and those overlap, as discussed in the CUDA Fortran programming guide. Or, you can use cray pointers, which are very light-weight, and the compiler basically does all the aliasing for you, you just have to initialize them correctly (which is a little wonky, but once done, works great). Here is an example using both:

module mtests
contains
attributes(global) subroutine testit(x)
real, device :: x()
integer, shared :: k(
) ! r and k overlap with dynamic space passed in
real, shared :: r(*)
integer :: i1, i4
real, device :: s(16); pointer(ps, s) ! Cray pointer
integer, device :: m(16); pointer(pm, m) ! Cray pointer
iam = threadIdx%x
if (iam .le. 16) then
i1 = z’3f800000’
k(iam) = i1
else if (iam .le. 32) then
r(iam) = 2.0
else if (iam .le. 48) then
ps = loc(r(33))
idx = iam - 32
s(idx) = 3.0
else
pm = loc(k(49))
i4 = z’40800000’
idx = iam - 48
m(idx) = i4
end if
call syncthreads()
x(iam) = r(iam)
return
end subroutine
end module mtests

program t
use mtests
use cudafor
real, managed :: x(64)
call testit<<<1,64,256>>> (x)
istat = cudaDeviceSynchronize
do i = 1, 64
print *,i,x(i)
end do
end

Thank you for your thorough rely!

The documentation may need an update as it seems to clearly indicate that cray pointers are not allowed here (perhaps it is supposed to say f90 style pointers). But later says they are recommend for accessing memory across a thread block cluster here.

Thanks again for your help.