Std::transform_reduce incompatible with nvc++ -stdpar=gpu

I am trying to write GPU accelerated programs only using Standard C++ algorithms. I would like to implement a “dot product” between vectors using std::transform_reduce. The code I’ve written is the following:


#include <execution>
#include <iostream>
#include <numeric>

int dim = 10000;

double dot_product(std::vector<double> v1, std::vector<double> v2)
    return std::transform_reduce(std::execution::par_unseq, v1.begin(), v1.end(), v2.begin(), 0.0);

int main()
    std::vector<double> v1(dim, 1.0), v2(dim, 1.0);
    std::cout << dot_product(v1, v2) << std::endl;
    return 0;

This program runs correctly when compiled with

g++ -std=c++20 test.cpp -o test -ltbb

and with

nvc++ -std=c++20  -stdpar=multicore  test.cpp -o test

but it fails when compiled with

nvc++ -std=c++20  -stdpar=gpu  test.cpp -o test

When I try to run the program compiled with -stdpar=gpu flag, it returns

Failing in Thread:0
call to cuInit returned error 804: Other

How can I solve this problem?
(Note: I’ve also tried using std::execution::par without success)

I had to add
#include <vector>
to get it to compile, but then it worked for both multicore and gpu.

$ nvc++ -std=c++20 -stdpar=gpu test2.cpp -o test_gpu
$ nvc++ -std=c++20 -stdpar=multicore test2.cpp -o test_cpu

$ ./test_cpu

$ ./test_gpu

cuInit error is more of a setup error.