OpenAcc and PGI

Hi everyone,
I just discovered OpenAcc and I am quite impressed by what it can achieve in terms of accelerating programs.
I want to delve in deeper in the details of OpenAcc and its internals but I don’t find any resource explaining how OpenAcc really works. There are just a bunch of tutorials if I might say and some API specifications I found on the net but nothing else.
By internals I mean how does the compiler works when we pass the flag -ta=nvidia or -Mcuda for example. What is generated underneath the hood? Is OpenAcc using in this case cuda runtime library or device library? Is everything dynamically linked or are there some statically linked loaded libraries there? Does the generated executable (after compiling) containing optimized code for the targeted accelerator?
Can someone please provide some answers to my questions or direct me to some useful resources, I am not interested in how to program in OpenACC the web contains great and enough practical courses, tutorials and explanations.
Thank you

Hi youness,

There are too many details to cover here in the forum, but will do my best to give some of the basics.

By internals I mean how does the compiler works when we pass the flag -ta=nvidia or -Mcuda for example. What is generated underneath the hood?

When targeting Tesla devices, by default the compiler will generate LLVM code which is then passed to the NVIDIA NVVM back-end compiler to generate the device object code. When using RDC (relocatable device code), at link time, we use nvlink to then create a device binary that gets embedded in the final executable.

Without RDC (-ta=tesla:nordc or -Mcuda=nordc), a binary as well as PTX is created and embedded in the object file itself. At link time, each device binary is embedded in the final executable.

We also have an older method where we did a source to source translation to very low level CUDA code. We mostly use this for debugging since the LLVM code generation is akin to assembly code. If you want, you can view the generated CUDA code for OpenACC via “-ta=tesla:nollvm,keep”. The code is outputted to a “.gpu” file. You can also keep the generated LLVM device code, but is more difficult to read.

Is OpenAcc using in this case cuda runtime library or device library?

Runtime.

Is everything dynamically linked or are there some statically linked loaded libraries there?

By default, the binary will be dynamically linked. You can use the flag “-Bstatic_pgi” to have the PGI runtime linked statically. There’s also “-Bstatic” to have all libraries linked statically, but may fail since not all system or CUDA libraries have static version available.

Does the generated executable (after compiling) containing optimized code for the targeted accelerator?

Yes. The compiler will optimize the code for a given target accelerator.

-Mat

Hi Mat,
Thank you for answer, but I am still a bit confused. It would be great if you can help. I ldd the executable after compiling with -ta=tesla -Mcuda and I found libcudart.so and libcudadevice.so loaded. So at first I tought that they were both used when the program is running. But when I run the program under nvprof I didn’t find cuda* calls but rather cu* calls which means that the device library is the only library used. Am I correct or no? If so the library used by OpenACC would be the Device not the runtime.
And in the .gpu file as you mentioned which is the cuda code there is the cuda runtime header threadIdx and blockDim built-in variables but no cuda calls (cudaMalloc …)
Thanks
-Younes

If so the library used by OpenACC would be the Device not the runtime

My apologies, I got it backwards. OpenACC (i.e. -ta=tesla) will use libcudadevice, while -Mcuda uses libcudadevrt. So using both flags will use both CUDA runtime libraries.

And in the .gpu file as you mentioned which is the cuda code there is the cuda runtime header threadIdx and blockDim built-in variables but no cuda calls (cudaMalloc …)

Correct, since this is only the CUDA kernel. Device memory allocation is a host side operation so these calls are part of the host side code generation. For these, you’ll need to review the host assembly (-Mkeepasm), though you’ll want to look for calls to OpenACC runtime library rather than direct calls to cudaMalloc.

Another thing you may find useful is to set the environment variable “PGI_ACC_DEBUG=1”. This will have the OpenACC emit debugging information about every call it makes, including device memory allocation and data movement.

-Mat