How to understand generated Intermediate cuda kernel code Arguments/Map with Open ACC kernel in PGI compiler

Code Reference:https://www.olcf.ornl.gov/tutorials/openacc-vector-addition/
vecAdd .f90 openACC code and compiled code with PGI Complier by flags acc -ta=tesla:cc70,nollvm,keep -Minfo=acce

Accelerator Kernel generated code(vecadd.f90)

global launch_bounds(128) void main_29_gpu(
double* p3/* .Z0640 /, —>a (1:n ) —>openACC
double
p5/* .Z0634 /, -->b(1:n) —> openACC
double
p6/* .Z0646 /,----> result(c(1:n)—>openACC
long long x1/
z_b_7 /,— what this Argument and how to map OpenACC
long long x4/
z_b_3 /,—>what this Argument and how to map OpenACC
long long x8/
z_b_11 */)—>what this Argument and how to map OpenACC

{
}

Open ACC Parallel code:

!$acc kernels !copyin(a(1:n),b(1:n)), copyout(c(1:n)) do i=1,n
c(i) = a(i) + b(i) sum = sum + c(i)
enddo !$acc end kernels

I believe these are the starting offsets into the arrays allowing for correct indexing if the lower bound of the array isn’t 1. Though, I can ask a compiler engineer if you need a more definitive answer.

1 Like

I would like have more definitive answer on this. As well on Open ACC pragam in do loop. On this points it will helpful me on better understanding on PGI and accelerator kernel code.

Example:
!varables t_mass,t_theta_rdz,t_dt_theta_rdz are REAL(rstd),POINTER
DO ind=1,ndomain
$acc parallel loop present(t_mass(:,:), t_theta_rdz(:,:,:),t_dt_theta_rdz(:,:)) async /*
DO l = ll_begin, ll_end
$acc loop
!DIR$ SIMD
DO ij=ij_begin,ij_end
t_dt_theta_rdz(ij,l) = t_theta_rdz(ij,l,1) / t_mass(ij,l)
ENDDO
ENDDO
$acc end parallel loop
ENDDO

global launch_bounds(128) void
theta_rdh_gpu(
int tc8, —> ll_begin,
int tc7,—> ll_end
int n6,—> ij_begin
int n3, —> ij_end
signed char* p12/* t_theta_rdz$p /,
signed char
p15/* t_mass$p /,
signed char
p18/* t_dt_theta_rdz$p /,
/
is the bleow values long long values are offset Array \index of do loop or size of array( liket_theta_rdz$p) / -->??
long long x11/
t_theta_rdz$sd /,
long long x12/
t_theta_rdz$sd /,
long long x13/
t_theta_rdz$sd /,
long long x14/
t_theta_rdz$sd /,
long long x15/
t_theta_rdz$sd /,
long long x21/
t_mass$sd /,
long long x22/
t_mass$sd /,
long long x23/
t_mass$sd /,
long long x24/
t_mass$sd /,
long long x28/
t_dt_theta_rdz$sd /,
long long x29/
t_dt_theta_rdz$sd /,
long long x30/
t_dt_theta_rdz$sd /,
long long x31/
t_dt_theta_rdz$sd */)

@MatColgrove. I would like to know is any update/info from compiler engineer on this.

His preliminary assessment was the same as mine where since these are allocate 1D arrays, these are the lower bounds to the arrays. He was going to double check but hasn’t gotten back to me on that.

For the next code, since these are multidimensional arrays, the array descriptors are being passed in with “$p” being the device pointers and “$sd” being the section descriptors.

I should note that we didn’t carry forward support for the CUDA C based device code generator in the NV HPC Compilers. We’re not going to explicitly disable it, but wont be maintaining it. I’m unclear on the basis for your questions, but if your using this output to learn how to translate your code to CUDA C, then this really isn’t be best method since the generated code is too low level. If you’re writing your own OpenACC compiler, then you should be using the LLVM device code generator (libnvvm.so) since this will allow for a much broader range of code.

Thanks for the clarification,it is help me to understand on pgi, definitely i will try using LLVM device code generator on understanding OpenACC.
But I’d not understand on below terms on PGI compiler used generate intermediate kernel code on OpenACC pragma/loop
1.__pgi_drcp
2.pgf90_alloc04a_i8
3.pgf90_dealloc03a_i

These are calls to the compiler runtime. “drcp” performs a double precision floating point reciprocal operation. “alloc” is allocating memory, in this case an integer*8. “dealloc” is deallocating this memory.

Hi @MatColgrove
In the above code what are tc8,tc7,n6,n3 arguments and how should the arguments to be mapped to the do loop

While I can’t be sure since the example above is incomplete, but these variables do to correspond to the upper and lower bounds of the two loops. Since the body of the generate code is incomplete I’m not sure how they’re being used, but most likely they’ll be used for bound checking by the individual threads within the kernel.