Calling CUDA-library functions in OpenACC parallel region

¿Is it possible to call functions of a CUDA library inside of a OpenACC parallelized loop?

I have the following example code (file named test2d.c):

#include <iostream>
#include <npp.h>
#include <nppi.h>
#include <cuda_runtime.h>
#include <cstdlib>

#pragma acc routine
extern NppStatus nppiMulC_32f_C1R (const Npp32f *pSrc1, int nSrc1Step, const Npp32f nConstant, Npp32f *pDst, int nDstStep, NppiSize oSizeROI);

#pragma acc routine
extern NppStatus nppiAddC_32f_C1R (const Npp32f *pSrc1, int nSrc1Step, const Npp32f nConstant, Npp32f *pDst, int nDstStep, NppiSize oSizeROI);

int main(int argc, char **argv)
{
  Npp32f *x, *y, *tmp;
  int n = 10;
  int stepX = 0;
  int stepY = 0;
  int stepTmp = 0;
  NppiSize fullSize = {n, n};
  NppiSize roiSize = {n, 1};

  float *res =(float*) malloc(n*n*sizeof(float));

  x = nppiMalloc_32f_C1(n, n, &stepX);
  y = nppiMalloc_32f_C1(n, n, &stepY);
  tmp = nppiMalloc_32f_C1(n, n, &stepTmp);

  nppiSet_32f_C1R(1.0, x, stepX, fullSize);
   
  #pragma acc data deviceptr(x, y, tmp) 
  {
    #pragma acc parallel loop independent
    {
        for(int j=0; j<n; j++)
        {
            Npp32f *pSrc = &x[j*stepX];
            Npp32f *pTmp = &tmp[j*stepTmp];
            Npp32f *pDst = &y[j*stepTmp];
            nppiMulC_32f_C1R(pSrc, stepX, 2.0, tmp, stepTmp, roiSize);
            nppiAddC_32f_C1R(pTmp, stepTmp, 1.0, pDst, stepY, roiSize);    
        }
    }
  }
  
  cudaMemcpy(res, y, n*n*sizeof(float), cudaMemcpyDeviceToHost);

  for(int i = 0; i < n*n; i++)
  {
     std::cout << res[i] << std::endl;
  }

  nppiFree(x);
  nppiFree(y);
  nppiFree(tmp);
  return 0;
}

These are the compilation steps:

pgc++ -c -acc -Minfo=accel -ta:tesla:cc35 -ICUDA_HOME/include/ -ICUDA_HOME/samples/common/inc/ test2d.c

pgc++ test2d.o -acc -Minfo=accel -ta:tesla:cc35 -ICUDA_HOME/include/ -ICUDA_HOME/samples/common/inc/  -o test -LCUDA_HOME/lib64/ -L/opt/tools/cuda-8.0/lib64/  -lnppc -lnppial -lnppidei -lnppisu  -lcudart -Mcuda

It uses NVIDIA parallel primitives (NPP) inside of an OpenACC parallelized loop. If I remove the “parallel” directive it compiles and runs fine with the “data” directive. However when the parallel directive is added it does not recognize any of the NPP functions in the compilation. I am using pgc++ version 18.4 and CUDA version 8. This is the compiler output error:


pgc++ -c -acc -Minfo=accel -ta:tesla:cc35 -ICUDA_HOME/include/ -ICUDA_HOME/samples/common/inc/ test2d.c
main:
     34, Accelerator kernel generated
         Generating Tesla code
         35, #pragma acc loop gang, vector(10) /* blockIdx.x threadIdx.x */
     34, Generating implicit copy(roiSize)
pgc++ test2d.o -acc -Minfo=accel -ta:tesla:cc35 -ICUDA_HOME/include/ -ICUDA_HOME/samples/common/inc/  -o test -LCUDA_HOME/lib64/ -L/opt/tools/cuda-8.0/lib64/  -lnppc -lnppial -lnppidei -lnppisu  -lcudart -Mcuda 
nvlink error   : Undefined reference to 'nppiMulC_32f_C1R' in 'test2d.o'
nvlink error   : Undefined reference to 'nppiAddC_32f_C1R' in 'test2d.o'
pgacclnk: child process exit status 2: /opt/compilers/pgi/linux86-64/18.4/bin/pgnvd
make: *** [test] Error 2

Hi jcastro9999,

Is it possible to call functions of a CUDA library inside of a OpenACC parallelized loop?

Yes, it is possible, but can be a bit tricky.

First, the calling CUDA routine needs to be device callable, i.e. decorated with the “device” attribute.

Next you need to get the symbol name for the device. Since nvcc is a C++ compiler, the symbol name will be mangled. You may need to use “nm” on the library and then search for the correct name. Of course, this assumes that the symbol names hasn’t been stripped.

Next, you need to use the “bind” clause on the routine directive to have the correct mapping to the symbol name. Something like:

#pragma acc routine bind("mangled_symbol_name")

In cases like this where the symbol name may be difficult to determine, I’ve written CUDA interfaces that call the library routine. That way I can grab the symbol name out of my object rather than hunting for the symbol out of the library (especially if it’s stripped)

Also, be sure to compile and link with “-Mcuda” (which you do) so the compiler knows you’re linking with CUDA code. We set-up the OpenACC slightly different when mixing in CUDA.

Finally on occasion, you may need to fall back to using our CUDA code generator (-ta=telsa:nollvm), rather than using the default LLVM code generator.

Hope this helps,
Mat

Hi Mat,


I added the “device” attribute for both routines: “nppiAddC_32f_C1R” and “nppiMulC_32f_C1R” in the test2d.c, like this:

#pragma acc routine   
__device__  extern "C" NppStatus nppiMulC_32f_C1R (const Npp32f *pSrc1, int nSrc1Step, const Npp32f nConstant, Npp32f *pDst, int nDstStep, NppiSize oSizeROI);
#pragma acc routine
__device__  extern "C" NppStatus nppiAddC_32f_C1R (const Npp32f *pSrc1, int nSrc1Step, const Npp32f nConstant, Npp32f *pDst, int nDstStep, NppiSize oSizeROI);

When I use the “nm” tool to search for the mangled symbols of the functions in the generated object file (test2d.o) they do not appear at all. However, if I comment the pragma acc parallel directive I do find the unmangled symbols of “nppiAddC_32f_C1R” and “nppiMulC_32f_C1R” in test2d.o, since both functions are defined as extern C functions. Therefore I do not add the bind clause on the routine directives. I also used nm to search for the mangled symbols in the dynamic library (libnppial.so) that provides such functions and the name was not mangled.

I also tried using -ta=telsa:nollvm with unsuccessful results (same problem). So, am I doing something wrong here or is there something else I can try to solve this problem?

And finally, do you think that this approach of calling functions from a CUDA accelerated libraries inside an OpenACC parallelized loop is feasible or even a good approach?

Regards,

Are you sure these routines are callable from device code? Do you have an example CUDA program you are following?

Hi jcastro9999,

Apologies if I was unclear. Here you’re trying to mix OpenACC and CUDA in the same file. What I’m suggesting is that in the C source, you add the “routine” directive with the bind name being the mangled name of the function in the object file as compiled with nvcc.

For example, something like the following in the OpenACC enabled C
or header file:

#pragma acc routine  bind("NppStatus nppiMulC_32f_C1R")
NppStatus nppiMulC_32f_C1R (const Npp32f *pSrc1, int nSrc1Step, const Npp32f nConstant, Npp32f *pDst, int nDstStep, NppiSize oSizeROI); 

#pragma acc routine bind("NppStatus nppiAddC_32f_C1R")
NppStatus nppiAddC_32f_C1R (const Npp32f *pSrc1, int nSrc1Step, const Npp32f nConstant, Npp32f *pDst, int nDstStep, NppiSize oSizeROI);

My guess since the CUDA device function in a library, it’s probably mangled. Hence you’ll want to use “nm” on the library to find the device routine names. If mangled, be change the bind string name above to the appropriate managed name so the linker can find it.

Hopefully the library hasn’t been stripped so you can find the mangled name. If it is stripped, then you’ll need to ask the authors what name to use.

And finally, do you think that this approach of calling functions from a CUDA accelerated libraries inside an OpenACC parallelized loop is feasible or even a good approach?

It’s feasible and a fine approach. Though there may be some performance loss when calling subroutines from device code.

-Mat