Using PACK, RESHAPE , COUNT intrinsics from cuTENSOR Fortran lib

I have this subroutine:

   SUBROUTINE test2(x, y)
      INTEGER :: x(:,:), y(:,:)
      INTEGER :: i, npti
      INTEGER, ALLOCATABLE :: nptidx(:)
      LOGICAL, ALLOCATABLE :: mask(:)
      INTEGER, ALLOCATABLE :: z(:)
      INTEGER :: k, m, count_rate, count_start, count_end
      call system_clock(count_rate = count_rate)
      call system_clock(count_start)
      do k = 1, ncycles
         mask    = reshape(x > threshold, [jpi*jpj])
         npti    = count(mask)
         nptidx  = pack([(i, i=1,jpi*jpj)], mask)
         if (.not. allocated(z)) allocate(z(npti))
         DO m = 1, 10
            z = pack(reshape(y, [jpi*jpj]), mask)
         END DO
      end do
      call system_clock(count_end)
      write (*,*) "npti: ", npti
      write (*,*) "SUM nptidx: ", SUM(nptidx)
      write (*,*) "SUM z: ", SUM(z)
      write (*,*) "time: ", INT(real(count_end - count_start) / real(count_rate) / ncycles * 1e6)
      deallocate(z)
   END SUBROUTINE

Inspired by section 3.9.4 of current NVIDIA CUDA Fortran programming guide, I’d like to have a version of it running on GPUs using the CUDA versions of RESHAPE, COUNT, PACK from CUTENSOREX library. I’m trying to do this by making this version of the code:

  ! the GPU version of test2 (?) -- expected to call pack() and reshape() from the cutensorex
   SUBROUTINE test3(x, y)
      USE cutensorex, only: pack, reshape, count
      INTEGER :: x(:,:), y(:,:)
      INTEGER :: i, npti
      INTEGER, ALLOCATABLE :: nptidx(:)
      LOGICAL, ALLOCATABLE :: mask(:)
      INTEGER, ALLOCATABLE :: z(:)
      INTEGER, ALLOCATABLE :: y1d(:)
      INTEGER :: k, m, count_rate, count_start, count_end
      !$acc data copyin(x,y)
      call system_clock(count_rate = count_rate)
      call system_clock(count_start)
      do k = 1, ncycles
         !$acc host_data use_device(x)
         mask    = reshape(x > threshold, [jpi*jpj])
         !$acc end host_data  
         !$acc host_data use_device(mask)
         npti    = count(mask)
         !$acc end host_data
         !$acc host_data use_device(mask)
         nptidx  = pack([(i, i=1,jpi*jpj)], mask)
         !$acc end host_data
         if (.not. allocated(z)) allocate(z(npti))
         DO m = 1, 10
            !$acc host_data use_device(y)  
            y1d = reshape(y, [jpi*jpj])
            !$acc end host_data
            !$acc host_data use_device(y1d,mask)
            z = pack(y1d, mask)
            !$acc end host_data
         END DO
      end do
      call cudaDeviceSynchronize()
      call system_clock(count_end)
      !$acc end data
      write (*,*) "npti: ", npti
      write (*,*) "SUM nptidx: ", SUM(nptidx)
      write (*,*) "SUM z: ", SUM(z)
      write (*,*) "time: ", INT(real(count_end - count_start) / real(count_rate) / ncycles * 1e6)
      deallocate(z)
   END SUBROUTINE

But I have the compiler diagnostics looking like this:

$ nvfortran -cuda -acc=gpu -O3 -o sc_test sc_test.f90 -cudalib=cutensor
NVFORTRAN-S-0074-Illegal number or type of arguments to reshape - keyword argument source (sc_test.f90: 129)
NVFORTRAN-S-0099-Illegal use of derived type (sc_test.f90: 143)
  0 inform,   0 warnings,   2 severes, 0 fatal for test3

where line 129 is mask = reshape(x > threshold, [jpi*jpj]) and line 143 is: z = pack(y1d, mask).

I suspect I’m hitting some limitations here, but I can’t figure out what they are. Is there any way to make this code work?

Best regards,
Alexey

If you could post the full code, it would be easier to check what it is wrong.
From a quick look, you may need to define also the LHS as a device array, try to add a create(mask) and to copy(z) before line 129 and 143.

Hi mfatica,

Here you are: low_latency_cuda/toys/sc_icedyn/sc_test.f90 at master · a-v-medvedev/low_latency_cuda · GitHub

I’ll try to add such directives, but this is a bit demotivating because I don’t really understand the principles that should lead to a working code. Such a “grope in the dark“ activity is a bit irritating. May I suggest extending the documentation with (at least) working examples more complex that just a single function call?

–Alexey

SUBROUTINE test3(x, y)
      USE cutensorex, only: pack, packloc
      INTEGER :: x(:,:), y(:,:)
      TARGET :: x
      INTEGER :: i, npti
      INTEGER, ALLOCATABLE :: nptidx(:)
      LOGICAL, ALLOCATABLE :: mask(:)
      INTEGER, ALLOCATABLE :: z(:)
      INTEGER, POINTER :: x1d(:)
      INTEGER :: k, m, count_rate, count_start, count_end

      allocate(nptidx(jpi*jpj))  ! largest possible size
      allocate(z(jpi*jpj))       ! largest possible size

      !$acc data copyin(x,y) copyout(nptidx,z)
      x1d(1:jpi*jpj) => x(:,:)
      call system_clock(count_rate = count_rate)
      call system_clock(count_start)
      do k = 1, ncycles
         !$acc host_data use_device(x1d, nptidx)
         nptidx  = packloc(x1d > threshold, count=npti)
         !$acc end host_data
         DO m = 1, 10
            !$acc host_data use_device(x, y, z)
            z = pack(y, x > threshold)
            !$acc end host_data
         END DO
      end do
      call cudaDeviceSynchronize()
      call system_clock(count_end)
      !$acc end data
      write (*,*) "npti: ", npti
      write (*,*) "SUM nptidx: ", SUM(nptidx(1:npti))
      write (*,*) "SUM z: ", SUM(z(1:npti))
      write (*,*) "time: ", INT(real(count_end - count_start) / real(count_rate) / ncycles * 1e6)
      deallocate(z)
   END SUBROUTINE

The GPU version is expecting the array to be already allocated ( as noted in https://docs.nvidia.com/hpc-sdk/archive/23.7/compilers/cuda-fortran-prog-guide/index.html ).
Also, the timing you have will not be very accurate, because the first time you call these accelerated functions will need to load the library. If you call it twice, it will give you a more relevant timing.

Thanks for this version. Will any explanations follow? Why in your version reshape disappeared and the pointer appeared instead? Why mask array disappeared (it is a good thing to have) and the condition is duplicated instead (it is a bad thing to have)?

–Alexey

You don’t need reshape (which creates a complete new duplicate array) to treat a contiguous 2D array as a 1D array. Just recast it, using either a pointer, a cray pointer, or via a subroutine boundary.

You can create a mask if you want. But then again, that is another device array to store. The mask takes as much storage (logical(4)) as the source (integer(4))

Thanks, those are the performance considerations, and they totally make sense.

My original question was not about performance, but the way we use PACK and RESHAPE is a correct way. I understand now that I have to say “acc host_data use_device“ for all args and LHS arrays (which is not quite obvious from the compiler messages and (probably) not clearly stated in the docs). I also noticed that for SUM_PREFIX “acc host_data use_device“ is not necessary – if the data is on GPU, it works for me. If there is a difference between SUM_PREFIX and, say, RESHAPE, this makes things inconsistent, but may be I’m mistaken.

Besides that, in my original code I wasn’t able to make the “RESHAPE“ operator work in the way it is rendered there. Is `mask = reshape(x > threshold, [jpi*jpj])` not possible to write for some reason? In any case, it would be good to have a general guideline for PACK, RESHAPE and others – current compiler error messages seem quite cryptic to me.

–Alexey

sum_prefix should have the same requirements as reshape wrt host_data use_device. If you have an example where it doesn’t send it along.

Reshape is implemented via cutensor Permute, and that only takes real and complex data. I’ve asked about using it on integer (and maybe logical?) data, but the scaling by alpha as I understand it, is difficult to remove, or would require doubling of amount of code.

Section 3.9.4.7 of the cuda fortran guide states that the arrays to reshape can be of type real(2), real(4), real(8), complex(4), and complex(8). As I said, that is what cutensor supports. You are trying to pass a logical array.