Accelerating Standard C++ with GPUs Using stdpar

Originally published at: https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/

Historically, accelerating your C++ code with GPUs has not been possible in Standard C++ without using language extensions or additional libraries: CUDA C++ requires the use of host and device attributes on functions and the triple-chevron syntax for GPU kernel launches.OpenACC uses #pragmas to control GPU acceleration.Thrust lets you express parallelism portably but uses language…

This was a great article! It appears that all the discussions and example are based on accelerating standard C++ (code) without any need for CUDA programming but only on one single GPU.

From my work so far on multi-GPU programming, invoking two GPUs and partitioning the data in between always needs some CUDA related code – for instance, binding a MPI rank or a thread to one of the GPUs, or using CUDA Streams for simultaneous use of multiple GPUs and probably other approaches to enable accelerations on multi-GPUs all need selecting the device one way or another which needs CUDA.

All of these are in the opposite direction of “Accelerating Standard C++ with a GPU Using stdpar”, where the goal is to not change the CPU-based code (with no CUDA runtime API, etc.) and compile the code simply with NVC++. So I’m very curious if there any way around this currently, and if not is this something to look forward to in the future? I’d appreciate any insights here.

Yes, multi-GPU stdpar support is on the roadmap.

Great to hear! Here in August 2021 is there any new developments on the NVC++ compiler using multiple GPUs?

Not yet I’m afraid, stay tuned!

I understand that the containers in use must be using the heap, not the stack, in order for unified memory to have the data visible to both CPU and GPU. My question is whether or when it will be possible for the containers memory to be a mmap pointer instead of a RAM pointer?

Hi, thanks for the question. We are working on enabling more memory types such as stack memory for use with the parallel algorithms. mmap memory is not on our near-term roadmap, but I have forwarded your inquiry on to the team.

I’ve created a benchmark for Standard C++ Parallel STL functions. When compiling it with nvc++ with -stdpar these functions run much slower than the serial (single-core CPU) versions and sort() along with stable_sort produce Segmentation Fault (when sorting a vector of 100 Million 32-bit integers). This is running on a Dell Alienware laptop with GeForce RTX 3060 GPU and 12-th Gen Intel 14-core CPU.

Are there certain compiler switches that should be used to produce results that accelerate these functions? I use -stdpar and -O3 for the nvc++

When compiling (using nvc++) without -stdpar all benchmarks, including sort() and stable_sort(), run to completion without segmentation fault, executing on a single-core of the Intel CPU.

Thank you,
-Victor