I have a code base which uses “expression templates” approach to allow writing mathematical operations for thrust device_vectors with lazy evaluation. The idea is that one can write possibly long mathematical expressions with simple syntax using operator overloading and the kernel is only evaluated when it is fully specified and actually needed. The functionality is based on thrust transform_iterators and zip_iterators.
The performance of the code has been good even with rather complex expressions. However, when I tried to update our system to newer HPC SDK, I noticed severe regressions in performance with the code becoming ~10-100x slower. I managed to pinpoint the slowdown to version 24.11. 24.9 still works ok, but all the versions after that do not. The most recent version 25.7 is a bit better, but still unuseably slow.
Now I don’t know whether this is related to changes in thrust/CCCL or nvc++, but because both 24.9 and 24.11 appear to use the same version of CCCL, I would guess it is related to the compiler. Also, nvcc doesn’t seem to show same issues (at least in a simple test case), only nvc++.
I managed to create a small example, which is given below. This is rather artificial, but it mimics the behaviour of the code that would be produced through operator overloading and expression templates. In pseudo-code an expression like “(vec_a + vec_b) + vec_c” would become
transform
(
zip
(
transform
(
zip(vec_a, vec_b), Sum
),
vec_c
),
Sum
)
and so on.
The example:
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/device_vector.h>
#include <thrust/tuple.h>
#include <chrono>
#include <iostream>
// Adaptor for binary expressions
template <typename BinaryOp>
struct ApplyBinaryOp
{
BinaryOp op;
inline __host__ __device__ ApplyBinaryOp(BinaryOp f)
: op(f) {}
template <typename Tuple>
inline __host__ __device__ auto operator()(const Tuple& t) const
{
return op(thrust::get<0>(t), thrust::get<1>(t));
}
};
// Example operator
struct Sum
{
__host__ __device__
double operator()(double x, double y) const
{
return x+y;
}
};
int main()
{
// Generate some dummy vectors vec_a vec_b
const size_t N = 100000000;
thrust::counting_iterator<double> iter(1);
thrust::device_vector<double> vec_a(N);
thrust::copy(iter, iter + vec_a.size(), vec_a.begin());
thrust::device_vector<double> vec_b = vec_a;
// Create a lazy expression for
// "(vec_a + vec_b) + (vec_a + vec_b) + vec_b"
// Create vec_a + vec_b
auto z1 = thrust::make_zip_iterator(vec_a.begin(), vec_b.begin());
auto t1 = thrust::make_transform_iterator(z1, ApplyBinaryOp<Sum>(Sum{}));
// Add another (vec_a + vec_b) by referencing to t1
auto z2 = thrust::make_zip_iterator(t1, t1);
auto t2 = thrust::make_transform_iterator(z2, ApplyBinaryOp<Sum>(Sum{}));
// Finally add vec_b
auto z3 = thrust::make_zip_iterator(t2, vec_b.begin());
auto t3 = thrust::make_transform_iterator(z3, ApplyBinaryOp<Sum>(Sum{}));
// Force evaluation, eg. using thrust::reduce here
auto start = std::chrono::high_resolution_clock::now();
double x = thrust::reduce(t3, t3+N, double(0), thrust::plus<double>());
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Result " << x << " took " << duration.count() << " us." << std::endl;
}
If I compile this with "nvc++ -std=c++17 -stdpar -fast " I get following runtimes
24.9: ~3000 us
24.11: ~33300 us
25.7: ~23600 us
ie. 7-10x increase in runtime. While figuring this out, I tried many different O-levels and other flags, but nothing seemed to help.