For example, m=256.
The std::sort line properly works on CPU. But I need to make it work on GPU.
The data is already allocated on GPU using
#pragma acc data copyin(bodies)
My OS is Debian and I use the following compile line: cmake . -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_C_FLAGS="-acc=gpu -Minfo=acc -mcmodel=medium -tp=haswell -Minline -cuda -gpu=cuda11.0,cc70" -DCMAKE_CXX_FLAGS="-acc=gpu -Minfo=acc -mcmodel=medium -tp=haswell -Minline -cuda -gpu=cuda11.0,cc70" -DCMAKE_CXX_STANDARD=17
Is there any way to use any CUDA (maybe cublas) sort algorithm to sort bodies on GPU?
If there is a way, could You be so kind to tell how to properly write the code piece on GPU instead of std::sort on CPU and how to include proper header files (maybe installation of the library is required)?
Could You, please, give any helpful links, which guide how to install and use CUDA libraries with OpenAcc?
Thank You.
Have you looked at using C++ standard parallelism (stdpar)? Sort can be parallelized when give a parallel execution policy. Our implementation of stdpar will offload to the GPU via the flag “-stdpar=gpu” and is interoperable with OpenACC so can be added to your existing OpenACC code.
How to tell the compiler that bodies is a pointer to the data on the device here?
Could You be so kind to tell, is there any workaround?
Maybe, it is possible to use CUBLAS sort methods in OpenAcc code compiled by nvc++?
Thank You very much.
Assuming “m” is a variable, then bodies array is a VLA so implicitly dynamically allocated. Did you try using sort with an execution policy and it didn’t work?
How to tell the compiler that bodies is a pointer to the data on the device here?
Maybe I’m missing something but it should just work as is, assuming the loop is within the data region. Granted you should add the array shape to bodies in the data clause so the the entire array is copied to the device, rather than just the pointer. Also, I’ll usually add a “present” clause on the parallel loop.
#pragma acc data copyin(bodies[0:m])
{
.....
#pragma acc parallel loop gang vector present(bodies)
for(int i=0; i<n; ++i)
{
bodies[i]+=Point3D(1.0e-3,1.0e-3,1.0e-3);
}
.
If you’re encountering an error, please post the error you’re seeing, and if possible, a minimal reproducing example.
Maybe, it is possible to use CUBLAS sort methods in OpenAcc code compiled by nvc++?
As far as I’m aware, cuBLAS doesn’t have a sort. Are you thinking of Thrust? Then sure, but our C++ STDPAR implementation is built on-top of Thrust so using stdpar sort is basically the same thing.