I am trying to write GPU accelerated programs only using Standard C++ algorithms. I would like to implement a “dot product” between vectors using std::transform_reduce. The code I’ve written is the following:
//test.cpp
#include <execution>
#include <iostream>
#include <numeric>
int dim = 10000;
double dot_product(std::vector<double> v1, std::vector<double> v2)
{
return std::transform_reduce(std::execution::par_unseq, v1.begin(), v1.end(), v2.begin(), 0.0);
}
int main()
{
std::vector<double> v1(dim, 1.0), v2(dim, 1.0);
std::cout << dot_product(v1, v2) << std::endl;
return 0;
}
This program runs correctly when compiled with
g++ -std=c++20 test.cpp -o test -ltbb
and with
nvc++ -std=c++20 -stdpar=multicore test.cpp -o test
but it fails when compiled with
nvc++ -std=c++20 -stdpar=gpu test.cpp -o test
When I try to run the program compiled with -stdpar=gpu
flag, it returns
Failing in Thread:0
call to cuInit returned error 804: Other
How can I solve this problem?
(Note: I’ve also tried using std::execution::par
without success)