Device code generated from -stdpar versus thrust

Hello, I would like to understand a bit more about any differences in the device code generated with -stdpar and thrust.

For example, if I compile the following code for the device with clang++, nvc++, or nvcc:

thrust::device_vector<int> D(10, 1);
int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus<int>());

And then I compile the following code with nvc++ -stdpar=gpu:

std::vector<int> D(10, 1);
int sum = std::reduce(std::execution::par, D.begin(), D.end(), (int) 0, std::plus<int>());
  • Will the resulting device code be the same or might there be some subtle differences?
  • Does the answer depend on the compiler used for the thrust code?
  • Does the answer depend on the code being compiled (e.g. what about something more complicated than a simple reduce)?

For Thrust, this would be library call so I highly doubt you’d see any difference in the device code when using different host compilers.

For nvc++, our stdpar to GPUs rides on top of Thrust, so there’s probably not much different in the generated device code in this example. Memory management will be different in that with stdpar we’re using CUDA Unified Memory while Thrust will use device pointers directly.

Does the answer depend on the code being compiled (e.g. what about something more complicated than a simple reduce)?

I don’t have experience here nor have looked specifically code gen differences using Thrust directly or indirectly through nvc++ stdpar. Though doubt there would be much difference with how the parallelism is applied. If there are differences, it would be with the device code generation of lambda’s which would be under the control of the compiler as opposed to the Thrust library.

Thanks for the explanation. I didn’t know that stdpar was implemented on top of thrust.

Now I’m wondering, what are the main pros and cons of compiling a thrust program with nvc++ as opposed to nvcc? The killer feature of nvc++ is definitely stdpar, but when it comes to using the thrust library without stdpar, why would one choose nvc++ over nvcc or vice versa?

Probably doesn’t matter in this case. Both will work fine.

The main advantage of nvc++ is it’s support of multiple parallel models, stdpar, OpenACC, OpenMP, and CUDA (though CUDA support is still in progress). It’s also a native C++ compiler with better language support. It compiles in a single pass creating the device code late in compilation. This allows for greater opportunities for optimization, auto-creation of device functions and lambdas, potentially faster compilation, and no need for a separate host compiler.

Further details can be found in Bryce Lebach’s GTC talk, starting around the 15 min mark: https://www.youtube.com/watch?v=KhZvrF_w1ak

Thanks for the link - that presentation answered a lot of my questions about the differences between nvcc and nvc++.

I have one final question on this topic. You said that both nvcc and nvc++ will work fine for compiling code that uses the thrust library without stdpar. But using the thrust library often requires specifying the device annotation, which nvc++ doesn’t support yet. How is it that nvc++ can produce the same kind of output as nvcc for code that it doesn’t officially support? Or is it supported but just unofficially?

How is it that nvc++ can produce the same kind of output as nvcc for code that it doesn’t officially support? Or is it supported but just unofficially?

CUDA is not yet officially supported but enough of CUDA is implemented in nvc++ to compile Thrust, including the device attribute.

Indeed it looks like I can successfully run the first Thrust example from https://thrust.github.io by saving it to main.cpp and compiling with:

nvc++ -x cu main.cpp

However now I’m having a bit of trouble figuring out how to get this working with CMake.

If I create a placeholder CMakeLists.txt file like the following:

cmake_minimum_required(VERSION 3.20.0)
project(nvhpc-test LANGUAGES CXX)

And then run cmake -DCMAKE_CXX_FLAGS="-x cu" -DCMAKE_CXX_COMPILER=nvc++ . then it spits out a bunch of errors - too long to post here.

Likewise if I change CMakeLists.txt to the following:

cmake_minimum_required(VERSION 3.20.0)
project(nvhpc-test LANGUAGES CUDA)

Then the CMake configure step spits out the following:

$ cmake -DCMAKE_CUDA_COMPILER=nvc++ ..
-- The CUDA compiler identification is unknown
-- Detecting CUDA compiler ABI info
CMake Error in /home/sweemer/nvhpc-test/build/CMakeFiles/CMakeTmp/CMakeLists.txt:
  CUDA_ARCHITECTURES is empty for target "cmTC_9f548".


CMake Error: Error required internal CMake variable not set, cmake may not be built correctly.
Missing variable is:
_CMAKE_CUDA_WHOLE_FLAG
CMake Error in /home/sweemer/nvhpc-test/build/CMakeFiles/CMakeTmp/CMakeLists.txt:
  CUDA_ARCHITECTURES is empty for target "cmTC_9f548".


CMake Error at /home/sweemer/share/cmake-3.22/Modules/CMakeDetermineCompilerABI.cmake:49 (try_compile):
  Failed to generate test project build system.
Call Stack (most recent call first):
  /home/sweemer/share/cmake-3.22/Modules/CMakeTestCUDACompiler.cmake:19 (CMAKE_DETERMINE_COMPILER_ABI)
  CMakeLists.txt:2 (project)


-- Configuring incomplete, errors occurred!

And lastly if I specify both CXX and CUDA in CMakeLists.txt, then CMake picks up nvcc for the CUDA compiler, which is not what I want - I want it to use nvc++ for everything.

$ cmake -DCMAKE_CXX_COMPILER=nvc++ ..
-- The CXX compiler identification is NVHPC 21.9.0
-- The CUDA compiler identification is NVIDIA 11.4.100
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvc++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/sweemer/nvhpc-test/build

Is there any way to get the Thrust example to compile using only nvc++ and CMake? Or is there a bug in CMake that prevents these approaches from working properly? Any guidance you can provide would be greatly appreciated.

My environment is as follows:

CMake version: 3.22.20211103-g6251239
HPC SDK version: 21.9
Driver Version: 460.91.03
Driver CUDA Version: 11.2
Device: Tesla T4
OS: Ubuntu 18.04.6 LTS

I’m not much of an expert with using cmake so may not be too much help here. Though, I wouldn’t call this a bug in CMake but rather lack of support. KitWare, the maker of cmake, is probably assuming that since the compiler is NVIDIA, you must be using nvcc. I’ll add an RFE to ask management to engage with KitWare on extending this support once nvc++ officially supports CUDA.

Per: CMAKE_CUDA_ARCHITECTURES — CMake 3.28.20231127-g7da82cd Documentation

The CUDA_ARCHITECTURES option is set by querying the compiler as to which architectures it supports, but I have no idea how it’s doing this query. If it’s using the CUDA_ARCH macro, then that’s not something we’ll be able to support.

Though, it looks like you might be able to override this by setting CUDAARCHS — CMake 3.28.20231127-g7da82cd Documentation

Side note, the flag “-x cu” is accepted by nvc++ but just a placeholder for compatibility so doesn’t do anything.

-Mat

Thanks for the explanation Mat. Not sure how many other people will want to compile Thrust code with nvc++ using CMake besides me but it would be fantastic when it is eventually fully supported.

The reason why I insist on using nvc++ goes back to my other forum topic where you told me that the C++ standard library is not yet supported in device code. I’d like to use Thrust for the random library functionality until it is supported natively in nvc++, at which time I can just switch the random library I’m using instead of switching the compiler as well. Does that sound reasonable? Also do you have an estimate for when on device will be added to nvc++?

FYI, the reason why I added -x cu to the compile command for the Thrust example is because I get the following error when it’s omitted:

$ nvc++ main.cpp
"main.cpp", line 1: catastrophic error: cannot open source file "thrust/host_vector.h"
  #include <thrust/host_vector.h>
                                 ^

1 catastrophic error detected in the compilation of "main.cpp".
Compilation terminated.

If I add the include path explicitly then nvc++ can find the header, but I then get this error:

$ nvc++ -I /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/cuda/11.4/include main.cpp
"/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/cuda/11.4/include/thrust/system/detail/generic/sort.inl", line 191: error: static assertion failed with "unimplemented for this system"
    THRUST_STATIC_ASSERT_MSG(
    ^
          detected during:
            instantiation of "void thrust::system::detail::generic::stable_sort(thrust::execution_policy<DerivedPolicy> &, RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering) [with DerivedPolicy=thrust::cuda_cub::tag, RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<int>>, StrictWeakOrdering=thrust::less<int>]" at line 83 of "/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/cuda/11.4/include/thrust/detail/sort.inl"
            instantiation of "void thrust::stable_sort(const thrust::detail::execution_policy_base<DerivedPolicy> &, RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering) [with DerivedPolicy=thrust::cuda_cub::tag, RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<int>>, StrictWeakOrdering=thrust::less<int>]" at line 63
            instantiation of "void thrust::system::detail::generic::sort(thrust::execution_policy<DerivedPolicy> &, RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering) [with DerivedPolicy=thrust::cuda_cub::tag, RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<int>>, StrictWeakOrdering=thrust::less<int>]" at line 56 of "/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/cuda/11.4/include/thrust/detail/sort.inl"
            instantiation of "void thrust::sort(const thrust::detail::execution_policy_base<DerivedPolicy> &, RandomAccessIterator, RandomAccessIterator, StrictWeakOrdering) [with DerivedPolicy=thrust::cuda_cub::tag, RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<int>>, StrictWeakOrdering=thrust::less<int>]" at line 49
            instantiation of "void thrust::system::detail::generic::sort(thrust::execution_policy<DerivedPolicy> &, RandomAccessIterator, RandomAccessIterator) [with DerivedPolicy=thrust::cuda_cub::tag, RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<int>>]" at line 41 of "/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/cuda/11.4/include/thrust/detail/sort.inl"
            instantiation of "void thrust::sort(const thrust::detail::execution_policy_base<DerivedPolicy> &, RandomAccessIterator, RandomAccessIterator) [with DerivedPolicy=thrust::cuda_cub::tag, RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<int>>]" at line 215 of "/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/cuda/11.4/include/thrust/detail/sort.inl"
            instantiation of "void thrust::sort(RandomAccessIterator, RandomAccessIterator) [with RandomAccessIterator=thrust::detail::normal_iterator<thrust::device_ptr<int>>]" at line 19 of "main.cpp"

1 error detected in the compilation of "main.cpp".

Are you sure that -x cu is not doing anything? As far as I can tell it is the only way to add all the necessary dependencies to compile the Thrust example with nvc++.

Now I’m not so sure. I’ve not actually used it before and the help message states that it’s just a placeholder. I’ll need to ask engineering if that’s just a problem with the help message and now used to select the language of the file regardless of the file suffix.

% nvc++ --help -x cu
+x Ignored, for compatibility only

To enable CUDA support, I use the “-cuda” flag. Does using “-cuda” get rid of the error message?

Does that sound reasonable?

Make sense.

Also do you have an estimate for when on device will be added to nvc++?

Sorry, no idea. Though even if I did, I wouldn’t be able to give a timeline on a public forum.

What has been mentioned publicly is another GTC talk from Bryce. Although not specific to which features will be supported or when it will be available, we are working on a std implementation for the device called libnv++. See https://www.youtube.com/watch?v=f25--stqv6U starting around the 6 min mark.

Compiling with nvc++ -cuda main.cpp also works so it looks like -x cu does the same thing as -cuda. I agree that it would be good to update the help message to reflect that information.

Looking forward to seeing more announcements about libnv++ in the near future. Thanks again for all the help!

CUDA is not yet officially supported but enough of CUDA is implemented in nvc++ to compile Thrust, including the device attribute.

Just wanted to follow up on this and ask whether compiling CUDA with nvc++ is officially supported yet.

The latest documentation for version 22.5 says the following:

The NVFORTRAN compiler supports CUDA programming in Fortran. See the NVIDIA CUDA Fortran Programming Guide for complete details on how to use CUDA Fortran. The NVCC compiler supports CUDA programming in C and C⁠+⁠+ in combination with a host C⁠+⁠+ compiler on your system. See the CUDA C⁠+⁠+ Programming Guide for an introduction and overview of how to use NVCC and CUDA C⁠+⁠+.

So apparently nvfortran and nvcc officially support CUDA, but it doesn’t specifically say yet whether nvc++ officially supports CUDA. If it doesn’t yet, then do you have an estimate for when it will? If it does, then I recommend mentioning this in the documentation explicitly.

nvc++ is able to compile much of CUDA (basically enough to compile Thrust), but not all, and why it’s not considered supported as of yet. It’s an on-going project and being incrementally added. Lots of testing and bug fixing, but no firm timeline when it can be blessed as “officially” supported.