Mapping the kernel arguments for PGI generated OpenACC cuda code

I have an OpenACC fortran code, I am generating the cuda kernel code with pgi compiler using the flags “nollvm,keep”.
Using the kernel code will call the kernel from fortran file using iso_c_binding
But the results are mismatch with the OpenACC code
Please suggest on what i am missing

I am uploading the code for all of the files

openacc.f90.txt: The program to calculate the negation of all the values in array which has OpenACC pragma

Updated_fortran.f90.txt: The updated code with iso_c_binding where the kernel generated from the PGI is being used

kernel.cu.txt: This code contains the kernel code generated from the PGI and contains the memory allocations and kernel launch.

How are the loop iterators of fortran to be mapped(in which order) and what are the long long variables getting generated
kernel.cu.txt (1.1 KB) openacc.f90.txt (682 Bytes) updated_fortran.f90.txt (1.0 KB)

First thing to note that in Fortran, a “REAL” defaults to kind=4, which corresponds to a float in C. I adjusted your “update_fortran.f90” file to use “REAL(8)” which is a double in C.

I wasn’t sure where you got the kernel code from, but not sure it’s correct. I spent some time looking at how to get it work, but instead decided to grab the kernel generated recompiling “openacc.f90” without a fixed size for “e1”. Otherwise, the compiler will use this fixed size in the generated kernel which isn’t what you want.

Note that the generated kernel really isn’t a good starting point to create you’re own kernel. It’s very context specific and typically only used by our compiler engineers for debugging purposes. Our normal device code generation is done in LLVM. Also, it’s not a supported option any longer, so we can’t guarantee the generated code is correct in all cases.

So while I was able to get this example to work, I also wrote a simple kernel to mimic what the compiler is generating without the need for adjusting Fortran to C indexing. I’m not clear on your interest in the generated CUDA C, but it’s definitely not the right way if you’re wanting to see how to translate a Fortran loop into a CUDA C kernel.

Working example:

% cat updated_fortran.f90
program dynamic_array
     use iso_c_binding
  implicit none

  interface
    subroutine launch_main_gpu(e1,darray,b1)  bind(c,name="launch_main_gpu")
    use iso_c_binding
      integer, value       :: e1
      type(c_ptr),value    :: darray
      integer, value       :: b1

      end subroutine
   end  interface

        !rank is 2, but size not known
   real(8), dimension (:), allocatable,target :: darray
   integer(c_int) :: b1, b2, e1, e2
   integer :: i, j
   print*, "Enter the start of the array:"
   read*, b1
   print*, "Enter the end of the array:"
   read*, e1

   if (b1.ge.e1) then
        print "Error end must be greater than start ", e1, "<", b1
        stop
   endif
   ! allocate memory
   allocate ( darray(b1:e1) )
   do i = b1, e1
         darray(i) = i
         print*, "darray(",i,") = ", darray(i)
   end do


 call launch_main_gpu(e1,c_loc(darray),b1)
! !$acc kernels copyin(darray(b1:e1)), copyout(darray(b1:e1))
! !$acc  loop
!   do i = b1,e1
!         darray(i) = -i
!   end do
!   !$acc end kernels

   do i = b1, e1
         print*, "RET(",i,") = ", darray(i)
   end do
   deallocate (darray)
end program dynamic_array


% cat kernel.cu
#include <stdio.h>

__global__ __launch_bounds__(128)
void kernel(
    int tc2,
    double* p3,
    long long x2)
{
int _i_1, _i_2;
int e34;
int j15;
int j19;
int j20;
int j27;
e34 = ((int)gridDim.x)*(128); /* lilix:128 */
_i_1 = ((int)blockIdx.x)*(128); /* lilix:8 */
_i_2 = (_i_1)+((int)threadIdx.x); /* lilix:181 */
j15 = _i_2; /* lilix:182 */
j19 = ((((int)threadIdx.x)-(tc2))+(_i_1))+(1); /* lilix:190 */
j20 = -(_i_2); /* lilix:198 */
j27 = (tc2)+((e34)-(e34)); /* lilix:206 */
_BB_9: ;
if( ((j19)>0)) goto _BB_10; /* lilix:57 */
(( double*)p3/* .Z0631 */)[((long long)(j15))-(x2/* z_b_3 */)] = ((double)(j20)); /* lilix:71 */
_BB_10: ;
j15 = (j15)+(e34); /* lilix:187 */
j19 = (j19)+(e34); /* lilix:195 */
_i_1 = -(e34); /* lilix:200 */
j20 = (j20)+(_i_1); /* lilix:203 */
j27 = (j27)+(_i_1); /* lilix:211 */
if( ((j27)>0)) goto _BB_9; /* lilix:92 */
}


__global__ __launch_bounds__(128)
void simple_kernel(
    int n,
    double* p
)
{
   int nblks = gridDim.x;
   int blkdim = blockDim.x; // 128
   int blk = blockIdx.x;
   int tid = threadIdx.x;
   int i = (blk*blkdim) + tid;
   while (i < n) {
        p[i]=-p[i];
        i=i+(nblks*blkdim);
   }
}
extern "C" void launch_main_gpu(int e1, double *darray,int b1)
{
        double* d1;
        long long n;
        n=(e1-b1)+1;
        printf("Enter GPU main: %d %d\n",e1,b1);
        cudaMalloc(&d1,n * sizeof(double));
        printf("///////////////////////////Memory allocated//////////////////////////////\n");
        cudaMemcpy(d1, darray, n * sizeof(double),cudaMemcpyHostToDevice);
        printf("Launching Kernel\n");
#ifdef USE_SIMPLE
        simple_kernel<<<128,128>>>(n,d1);
#else
        kernel<<<128,128>>>(e1+1,d1,b1);
#endif
        cudaMemcpy(darray,d1,n * sizeof(double),cudaMemcpyDeviceToHost);
        printf("\nKernel call completed\n");
        cudaFree(d1);
}


% nvcc -c kernel.cu ; nvfortran -Mcuda updated_fortran.f90 kernel.o -c++libs ; a.out
updated_fortran.f90:
 Enter the start of the array:
5
 Enter the end of the array:
15
 darray(            5 ) =     5.000000000000000
 darray(            6 ) =     6.000000000000000
 darray(            7 ) =     7.000000000000000
 darray(            8 ) =     8.000000000000000
 darray(            9 ) =     9.000000000000000
 darray(           10 ) =     10.00000000000000
 darray(           11 ) =     11.00000000000000
 darray(           12 ) =     12.00000000000000
 darray(           13 ) =     13.00000000000000
 darray(           14 ) =     14.00000000000000
 darray(           15 ) =     15.00000000000000
Enter GPU main: 15 5
///////////////////////////Memory allocated//////////////////////////////
Launching Kernel

Kernel call completed
 RET(            5 ) =    -5.000000000000000
 RET(            6 ) =    -6.000000000000000
 RET(            7 ) =    -7.000000000000000
 RET(            8 ) =    -8.000000000000000
 RET(            9 ) =    -9.000000000000000
 RET(           10 ) =    -10.00000000000000
 RET(           11 ) =    -11.00000000000000
 RET(           12 ) =    -12.00000000000000
 RET(           13 ) =    -13.00000000000000
 RET(           14 ) =    -14.00000000000000
 RET(           15 ) =    -15.00000000000000

% nvcc -c kernel.cu -DUSE_SIMPLE ; nvfortran -Mcuda updated_fortran.f90 kernel.o -c++libs ; a.out
updated_fortran.f90:
 Enter the start of the array:
2
 Enter the end of the array:
17
 darray(            2 ) =     2.000000000000000
 darray(            3 ) =     3.000000000000000
 darray(            4 ) =     4.000000000000000
 darray(            5 ) =     5.000000000000000
 darray(            6 ) =     6.000000000000000
 darray(            7 ) =     7.000000000000000
 darray(            8 ) =     8.000000000000000
 darray(            9 ) =     9.000000000000000
 darray(           10 ) =     10.00000000000000
 darray(           11 ) =     11.00000000000000
 darray(           12 ) =     12.00000000000000
 darray(           13 ) =     13.00000000000000
 darray(           14 ) =     14.00000000000000
 darray(           15 ) =     15.00000000000000
 darray(           16 ) =     16.00000000000000
 darray(           17 ) =     17.00000000000000
Enter GPU main: 17 2
///////////////////////////Memory allocated//////////////////////////////
Launching Kernel

Kernel call completed
 RET(            2 ) =    -2.000000000000000
 RET(            3 ) =    -3.000000000000000
 RET(            4 ) =    -4.000000000000000
 RET(            5 ) =    -5.000000000000000
 RET(            6 ) =    -6.000000000000000
 RET(            7 ) =    -7.000000000000000
 RET(            8 ) =    -8.000000000000000
 RET(            9 ) =    -9.000000000000000
 RET(           10 ) =    -10.00000000000000
 RET(           11 ) =    -11.00000000000000
 RET(           12 ) =    -12.00000000000000
 RET(           13 ) =    -13.00000000000000
 RET(           14 ) =    -14.00000000000000
 RET(           15 ) =    -15.00000000000000
 RET(           16 ) =    -16.00000000000000
 RET(           17 ) =    -17.00000000000000

% a.out
 Enter the start of the array:
-10
 Enter the end of the array:
2
 darray(          -10 ) =    -10.00000000000000
 darray(           -9 ) =    -9.000000000000000
 darray(           -8 ) =    -8.000000000000000
 darray(           -7 ) =    -7.000000000000000
 darray(           -6 ) =    -6.000000000000000
 darray(           -5 ) =    -5.000000000000000
 darray(           -4 ) =    -4.000000000000000
 darray(           -3 ) =    -3.000000000000000
 darray(           -2 ) =    -2.000000000000000
 darray(           -1 ) =    -1.000000000000000
 darray(            0 ) =     0.000000000000000
 darray(            1 ) =     1.000000000000000
 darray(            2 ) =     2.000000000000000
Enter GPU main: 2 -10
///////////////////////////Memory allocated//////////////////////////////
Launching Kernel

Kernel call completed
 RET(          -10 ) =     10.00000000000000
 RET(           -9 ) =     9.000000000000000
 RET(           -8 ) =     8.000000000000000
 RET(           -7 ) =     7.000000000000000
 RET(           -6 ) =     6.000000000000000
 RET(           -5 ) =     5.000000000000000
 RET(           -4 ) =     4.000000000000000
 RET(           -3 ) =     3.000000000000000
 RET(           -2 ) =     2.000000000000000
 RET(           -1 ) =     1.000000000000000
 RET(            0 ) =    -0.000000000000000
 RET(            1 ) =    -1.000000000000000
 RET(            2 ) =    -2.000000000000000