Severe performance regression in nvc++ starting from 24.11

I have a code base which uses “expression templates” approach to allow writing mathematical operations for thrust device_vectors with lazy evaluation. The idea is that one can write possibly long mathematical expressions with simple syntax using operator overloading and the kernel is only evaluated when it is fully specified and actually needed. The functionality is based on thrust transform_iterators and zip_iterators.

The performance of the code has been good even with rather complex expressions. However, when I tried to update our system to newer HPC SDK, I noticed severe regressions in performance with the code becoming ~10-100x slower. I managed to pinpoint the slowdown to version 24.11. 24.9 still works ok, but all the versions after that do not. The most recent version 25.7 is a bit better, but still unuseably slow.

Now I don’t know whether this is related to changes in thrust/CCCL or nvc++, but because both 24.9 and 24.11 appear to use the same version of CCCL, I would guess it is related to the compiler. Also, nvcc doesn’t seem to show same issues (at least in a simple test case), only nvc++.

I managed to create a small example, which is given below. This is rather artificial, but it mimics the behaviour of the code that would be produced through operator overloading and expression templates. In pseudo-code an expression like “(vec_a + vec_b) + vec_c” would become

transform
(
zip
(
transform
(
zip(vec_a, vec_b), Sum
),
vec_c
),
Sum
)

and so on.

The example:



#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/device_vector.h>
#include <thrust/tuple.h>
#include <chrono>
#include <iostream>

// Adaptor for binary expressions
template <typename BinaryOp>
struct ApplyBinaryOp
{
    BinaryOp op;

    inline __host__ __device__ ApplyBinaryOp(BinaryOp f)
        : op(f) {}

    template <typename Tuple>
    inline __host__ __device__ auto operator()(const Tuple& t) const
    {
        return op(thrust::get<0>(t), thrust::get<1>(t));
    }
};

// Example operator
struct Sum
{
    __host__ __device__
    double operator()(double x, double y) const
    {
        return x+y;
    }
};

int main()
{
    // Generate some dummy vectors vec_a vec_b
    const size_t N = 100000000;
    thrust::counting_iterator<double> iter(1);
    thrust::device_vector<double> vec_a(N);
    thrust::copy(iter, iter + vec_a.size(), vec_a.begin());
    thrust::device_vector<double> vec_b = vec_a;

    // Create a lazy expression for
    // "(vec_a + vec_b) + (vec_a + vec_b) + vec_b"

    // Create vec_a + vec_b
    auto z1 = thrust::make_zip_iterator(vec_a.begin(), vec_b.begin());
    auto t1 = thrust::make_transform_iterator(z1, ApplyBinaryOp<Sum>(Sum{}));

    // Add another (vec_a + vec_b) by referencing to t1
    auto z2 = thrust::make_zip_iterator(t1, t1);
    auto t2 = thrust::make_transform_iterator(z2, ApplyBinaryOp<Sum>(Sum{}));

    // Finally add vec_b
    auto z3 = thrust::make_zip_iterator(t2, vec_b.begin());
    auto t3 = thrust::make_transform_iterator(z3, ApplyBinaryOp<Sum>(Sum{}));

    // Force evaluation, eg. using thrust::reduce here
    auto start = std::chrono::high_resolution_clock::now();
    double x = thrust::reduce(t3, t3+N, double(0), thrust::plus<double>());
    auto stop = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);

    std::cout << "Result " << x << " took " <<  duration.count() << " us." << std::endl;
}

If I compile this with "nvc++ -std=c++17 -stdpar -fast " I get following runtimes
24.9: ~3000 us
24.11: ~33300 us
25.7: ~23600 us

ie. 7-10x increase in runtime. While figuring this out, I tried many different O-levels and other flags, but nothing seemed to help.

If you use Nsys, you will see that the issue is related to different number of registers.

Slow case:

Begins: 0.379154s

Ends: 0.399616s (+20.462 ms)

grid: <<<400, 1, 1>>>

block: <<<256, 1, 1>>>

Launch Type: Regular

Static Shared Memory: 80 bytes

Dynamic Shared Memory: 0 bytes

Registers Per Thread: 208

Local Memory Per Thread: 0 bytes

Local Memory Total: 152,043,520 bytes

Shared Memory executed: 8,192 bytes

Shared Memory Bank Size: 4 B

Theoretical occupancy: **12.**5 %

Fast case:
Begins: 0.356016s

Ends: 0.358621s (+2.605 ms)

grid: <<<2400, 1, 1>>>

block: <<<256, 1, 1>>>

Launch Type: Regular

Static Shared Memory: 80 bytes

Dynamic Shared Memory: 0 bytes

Registers Per Thread: 40

Local Memory Per Thread: 0 bytes

Local Memory Total: 94,371,840 bytes

Shared Memory executed: 8,192 bytes

Shared Memory Bank Size: 4 B

Theoretical occupancy: 75 %

Hi tniemi and welcome,

This is an odd one in that nvc++ doesn’t do much here and simply uses the Thrust headers. Though you’re correct that the CUDA and Thrust version are the same between NVHPC 24.9 and 24.11, so something else changed.

The performance delta looks be coming from the reduce where the schedule changed, though I’ll need to ask engineering to look into why. I filed a problem report, TPR#37844, and have asked them to take a look.

-Mat

Hi,

Thanks for looking into this.

Yes, registers are probably the issue here, or at least they hint what is the problem. Somehow the older compilers are able to produce optimized code, while newer ones fail. What is surprising is that in our actual code, the kernels can contain several tens of operations, including nasty branching max/mins, and it still works nicely with older compilers. With the new compilers, just a couple of plus ops is enough to crash the performance. Also, the issue is not related to specific algorithm, such as reduce, but even a simple call such as copy is enough to show the issue. In fact, eg.

thrust::copy(t3, t3 + N, vec_a.begin())

shows even worse performance of 80 ms vs 4 ms.

Although just using thrust headers, the compiler does have quite a job when it evaluates the final iterator. It can be a long sequence of function calls, which it should replace with the underlying primitive math ops and then optimize the whole thing. So it is quite an impressive feat when it works.