Sort array on GPU using CUDA libraries and OpenAcc

I have a C++ program which I need to make work on GPU using nvc++ and OpenAcc. Yet it properly works on CPU. There is a code line in it:

std::sort(bodies,bodies+n,[](PointD a, PointD b){return fabs(a.i)>fabs(b.i);})

Here

typedef struct {
  Point3D r;
  int i;
  uint64_t s;
} PointD

Point3D is a structure denoting a 3D vector and

PointD bodies[m];

For example, m=256.
The std::sort line properly works on CPU. But I need to make it work on GPU.
The data is already allocated on GPU using

#pragma acc data copyin(bodies)

My OS is Debian and I use the following compile line:
cmake . -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_C_FLAGS="-acc=gpu -Minfo=acc -mcmodel=medium -tp=haswell -Minline -cuda -gpu=cuda11.0,cc70" -DCMAKE_CXX_FLAGS="-acc=gpu -Minfo=acc -mcmodel=medium -tp=haswell -Minline -cuda -gpu=cuda11.0,cc70" -DCMAKE_CXX_STANDARD=17
Is there any way to use any CUDA (maybe cublas) sort algorithm to sort bodies on GPU?
If there is a way, could You be so kind to tell how to properly write the code piece on GPU instead of std::sort on CPU and how to include proper header files (maybe installation of the library is required)?
Could You, please, give any helpful links, which guide how to install and use CUDA libraries with OpenAcc?
Thank You.

Hi _and,

Have you looked at using C++ standard parallelism (stdpar)? Sort can be parallelized when give a parallel execution policy. Our implementation of stdpar will offload to the GPU via the flag “-stdpar=gpu” and is interoperable with OpenACC so can be added to your existing OpenACC code.

For details please see: Accelerating Standard C++ with GPUs Using stdpar | NVIDIA Developer Blog

-Mat

I allocate bodies statically, not dynamically. It is written in the link You posted:

Currently, only data dynamically allocated on the heap in CPU code that was compiled by NVC++ can be managed automatically

I allocate the array on CPU as:

PointD bodies[m];

and copy it to GPU using:

#pragma acc data copyin(bodies)

I also don’t know how to deal with dynamic data pointers on GPU. For example, I have a loop:

#pragma acc parallel loop gang vector
for(int i=0; i<n; ++i)
{
  bodies[i]+=Point3D(1.0e-3,1.0e-3,1.0e-3);
}

How to tell the compiler that bodies is a pointer to the data on the device here?
Could You be so kind to tell, is there any workaround?
Maybe, it is possible to use CUBLAS sort methods in OpenAcc code compiled by nvc++?
Thank You very much.

Assuming “m” is a variable, then bodies array is a VLA so implicitly dynamically allocated. Did you try using sort with an execution policy and it didn’t work?

How to tell the compiler that bodies is a pointer to the data on the device here?

Maybe I’m missing something but it should just work as is, assuming the loop is within the data region. Granted you should add the array shape to bodies in the data clause so the the entire array is copied to the device, rather than just the pointer. Also, I’ll usually add a “present” clause on the parallel loop.

#pragma acc data copyin(bodies[0:m])
{
.....
#pragma acc parallel loop gang vector present(bodies)
for(int i=0; i<n; ++i)
{
     bodies[i]+=Point3D(1.0e-3,1.0e-3,1.0e-3);
}

.
If you’re encountering an error, please post the error you’re seeing, and if possible, a minimal reproducing example.

Maybe, it is possible to use CUBLAS sort methods in OpenAcc code compiled by nvc++?

As far as I’m aware, cuBLAS doesn’t have a sort. Are you thinking of Thrust? Then sure, but our C++ STDPAR implementation is built on-top of Thrust so using stdpar sort is basically the same thing.