Offload a complicated class to GPU with openacc

snailsoar · April 26, 2022, 11:13pm

Hello,

I am trying to do a geometrical search in the GPU by offloading the classes to the device. The classes implement a K-dimensional tree, the purpose of which is that for a given point one can find the closest point in a point cloud very quickly.

The main code (attached below and also the mini app) looks quite simple. It builds the tree and do the search for the provided points. Each search will be independent from each other, this is why I would like to do it in the device.

All the functions in the classes can perform sequentially in the device. There are lots of things happening behind the scenes of the kdt->build and kdt->nearest, I am wondering is it possible to offload this class (and classes related to this one) to the device?

I have done some attempts and it was not very successful, could someone shine some light on this please?

Thanks for your help in advance,
Feng

using namespace std;

# include <iostream>
# include <fstream>
# include <kdtree.h>

   int main()
  {
      Real *x[3],d,x0[3];
      Int ip,np, it,id,nt;
      cKdTree *kdt;
      ifstream fle;

      np = 15;
      x[0] = new Real [np];
      x[1] = new Real [np];
      x[2] = new Real [np];

      fle.open("check.dat");
      for(ip=0; ip<np; ip++)
     {
         fle >> x[0][ip] >> x[1][ip] >> x[2][ip];
     }
      fle.close();

      kdt= new cKdTree();
      kdt->build( 3,np, x );

      cout << "done with building the tree\n";

      nt = 1000;
     #pragma acc parallel loop \
      private(x0) \
      present(kdt[:1]) \
      default(none)
      for(it=0; it<nt; it++)
     {
         //in real case, x0 will be differnt for each "it" and each search is independent, this is just for test
         x0[0] = 10.;
         x0[1] = 0.0533958 ;
         x0[2] = 9.9;
         kdt->nearest( x0,&id,&d );
         //cout << it << " " << id << " " << d << "\n";
     }

      #pragma acc exit data copyout(kdt[:1])
      delete[] x[0];
      delete[] x[1];
      delete[] x[2];
      delete kdt;

      return 0;
  }

The mini application is:
check_kdtree.tar (40 KB)

MatColgrove · April 26, 2022, 11:44pm

Hi Feng,

In this case, I think you’re best to use CUDA Unified Memory (i.e. add “-gpu=managed” to you’re compile flags).

You can use data directives, but would need to do a manual deep copy of the data using “enter data” directives and possibly some “attach” clauses for the child boxes. Given the complexity of the structure, it can be a bit difficult to get right.

If the code only copies the structure to the device, does all the compute of the data on the devices, then copies the data back, there’s no performance advantage to doing a manual deep copy versus using UM’s. So again, you’ll be better off using UM in this case.

-Mat

snailsoar · April 27, 2022, 1:11pm

Hi Mat,

Thanks for your reply!

This mini app resembles one part of my full application. In the full application, I don’t use Unified Memory, so I might not be able to use unified memory in this case.

Besides, Is it possible for me to construct this class (i.e.kdt= new cKdTree();) and build it (i.e. kdt->build( 3,np, x );) in the host and then copy this kdt object to the device? then I call the “kdt->nearest” in the device?

Thanks,
Feng

MatColgrove · April 27, 2022, 4:46pm

It should be possible, but just a bit tricky. If you can use UM for the full app, then I’d start there since it will make things very easy.

Alternately, you can do the allocation for cKdTree and it’s subcomponents using “cudaMallocManaged” so only those are managed. Then remove “-gpu=manged” from the compilation but keep it on the link line. Also add “-cuda” to both.

If neither work for what you need, then we can dive into porting the class to do a manual deep copy. I’m a bit swamped right now, but will help as best I can.

I do have an example generic container type class from my chapter in Parallel Programming with OpenACC that might be helpful. It’s much simpler than yours but may give you some ideas. See: https://github.com/rmfarber/ParallelProgrammingWithOpenACC/blob/master/Chapter05/accList.h

Look for the “accList” source for the use cases in https://github.com/rmfarber/ParallelProgrammingWithOpenACC/tree/master/Chapter05

-Mat

snailsoar · April 28, 2022, 7:29am

Hi Mat,

Alternately, you can do the allocation for cKdTree and it’s subcomponents using “cudaMallocManaged” so only those are managed.

I think this works for me! I replaced the new with cudaMallocManaged in my mini app, I am getting expected results. Besides, this looks a bit like magic to me, could you please elaborate a bit on what’s happening behind the scenes, like how does it deal with these classes in this case?

Then remove “-gpu=manged” from the compilation but keep it on the link line. Also add “-cuda” to both.

“but keep it on the link line”, what does this mean? I only add “-cuda” in the compile option.

I will try my full application and come back if I have more issues.

Thanks!,
Feng

MatColgrove · April 28, 2022, 5:17pm

like how does it deal with these classes in this case?

UM creates a memory pool with virtual addresses which can be mapped to both the host and the device. The CUDA driver will the then implicitly copy the data to/from the device when accessed (by page). Currently UM is only available for dynamically allocated data. For a more complete intro see: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

“ but keep it on the link line ”, what does this mean? I only add “-cuda” in the compile option.

The runtime needs to know that managed memory is being use so at link time we add an object with sets this during initiation of the binary. Though on second thought, the “-cuda” flag will also do this so should be suffient.

-Mat

snailsoar · April 29, 2022, 9:20am

Thanks Mat, I will have a look at that.

snailsoar · May 2, 2022, 8:16am

Hi Mat,

I have tried it in my full application and I have the following complains from the compiler. Could you please provide some advice?

I have added the compiler complains behind the specific line. These lines are inside a “#pragma acc parallel loop gang vector” construct. I have also specified “nearest” as “#pragma”

         kdt->nearest( x0,&it,&d ); //ptxas fatal   : Unresolved extern function '_ZN7cKdTree7nearestEPdPiS0_'
         iq = idt[it]; //Scalar last value needed after loop for it at line 414, 414 is the line number in the original file
         is = isec[it];

Many thanks,
Feng

Below is the snippet of the function. The full application is quite big, you need more information of the code, I am happy to provide.
slide.cpp (4.6 KB)

MatColgrove · May 2, 2022, 4:25pm

Hi Feng,

How are you linking? Are you using nvcc? Is the OpenACC code in a shared library?

My initial thought is that it’s a relocatable device code (RDC) issue. RDC enables device code linking but isn’t enabled by default with nvcc and not supported in OpenACC C++ shared objects. For nvcc you can try setting “–rdc=true”, but with a shared object, you’ll want to try and inline “nearest” so no call is necessary. Since the inlining needs to be done across source files, this means a two pass compilation, first with “-Mextract=lib:iLibname” to extract the inlining info, then “-Minline=lib:iLibname” to use it.

-Mat

snailsoar · May 2, 2022, 9:50pm

Hi Mat,

I am compiling with the “mpicxx” included in the hpc_sdk and I have the “-shared” flag in the compile option. Below is a fraction of the makefile which generates one of the “.so” file:

       COPT=-g -acc -Minfo=accel -gpu=cc61,nordc -Mcuda $(DEFS) -fPIC -Wall
       PCCMP=mpicxx
       BINS= ../liborg.so
      $(BINS): $(OBJS)
             $(PCCMP) -shared $(OBJS) -o $@
      .cpp.o:
             $(PCCMP) $(COPT) $(ORGI) $(ENGI) $(SYSI) $(PETSCI) -o $@ -c $<

The codes are linked with each other dynamically. The cKdTree class and the cSlideFbndry class are compiled to the same “liborg.so” file.

Thanks,
Feng

MatColgrove · May 2, 2022, 10:24pm

“nordc” is the problem. In order to call device routines in separate files, the device code needs to be linked. But by using “nordc”, no link is performed. This and access to external global variables are the two main restrictions of using “nordc”.

A couple options to try:

Merge the sources so the definition of “nearest” (as well as the routines called by nearest) are visible at compilation. Possibly moving “nearest”'s definition to the header file might work as well. You may also want to add the “inline” keyword to the signatures. The compiler will then attempt to inline the routines.
Compile the sources together on the same compile line with your other options plus “-Minline” (i.e. nvc++ -g -acc -Minfo=accel,inline -gpu=cc61,nordc -Mcuda $(DEFS) -fPIC -Wall -Minline kdtree.cpp main.cpp).
Do the two pass inlining using -Mextract (pass 1) then -Minline (pass 2). -Mextract first creates an inline library which can be used by -Minline to inline the routines. See: HPC Compiler Reference Manual Version 22.7 for ARM, OpenPower, x86

I can’t guaranteed the compiler will be able to inline the routine (check the compiler feedback messages from -Minfo=inline to verify). If not, as a last resort you’ll need to manually inline the routines.

-Mat

snailsoar · May 2, 2022, 10:38pm

Hi Mat,

Thanks for your prompt reply!

I have tried option 1 and the code compiles now! I will see if I manage to let the code run and produce the expected results.

Another thing I found dodgy is that, for this function:

         coo->toffset( 0, 1, ptch*is, tmpsq, 1 );

This function is also in the same “#pragma acc parallel loop gang vector”, the compiler does not complain about this function. The definition of the class of “coo” is even in a different “.so” file.

Besides, for this compiler output:

       Scalar last value needed after loop for it at line XXX

Should I worry about this? If yes, how could I fix this?

Thanks,
Feng

MatColgrove · May 2, 2022, 10:56pm

Hmm, even with Fortran and C based shared objects where RDC is supported, device calls can only be made to device routines within the shared object itself. There’s not a dynamic linker for device code (at least not yet) so I have no idea why this is working.

Are you sure “toffset”'s definition isn’t visible during compilation?

Maybe dead-code elimination is removing the call or the OpenACC compute regions isn’t getting offloaded?

snailsoar · May 3, 2022, 11:46am

Hi Mat,

I will double-check that. This is a bit weried to me.

         kdt->nearest( x0,&it,&d ); //ptxas fatal   : Unresolved extern function '_ZN7cKdTree7nearestEPdPiS0_', I think I know how to fix this
         iq = idt[it]; //Scalar last value needed after loop for it at line 414, 414 is the line number in the original file
         is = isec[it];

Should I be concerned with the compiler message “Scalar last value needed after loop for it at line 414”

Thanks,
Feng

MatColgrove · May 3, 2022, 4:25pm

Sorry, I missed that you asked this earlier.

It may or may not be of a concern depending on which loop it inhibiting parallelization. This means that the variable “it” may be being used after the loop. Hence if parallelized, the value of the variable will be which ever thread executed last, not the value from the last iteration.

In this case, it’s most likely being caused by “it” being passed by address to “nearest”. While unlikely, the compiler must assume that the address will be taken by a global variable within the function.

The fix in this case is add “it” to a private clause on the parallel loop.

-Mat

snailsoar · May 5, 2022, 10:46pm

Hi Mat,

Thanks for your reply. It seems to me that this “warning” does cause any harm for the moment.

Thanks,
Feng

system · May 19, 2022, 10:46pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.