To perform performance testing, how to prevent side effect-free code from being optimized away by the compiler?

Consider a class representing large integers, for which I have implemented the

class Integer {
public:
     Integer operator+(const Integer& other) const {
            //Some code
     }
     int value[4];
}

function. Now, I need to test the performance of this function. To do this, I have written the following test kernel function:

__global__ void kernel(Integer *out, const Integer *in_a, const Integer *in_b) {
       uint32_t tid = blockDim.x * blockIdx.x + threadIdx.x;
       Integer a = in_a[tid], b = in_b[tid];
       for (uint32_t i = 0; i < 100000; i++) {
              a + b;
       }
       //Do not write result of a + b to global memory this time.
}

Clearly, the above code, when optimized by the compiler, will optimize out the side-effect-free code a + b;, making it impossible to correctly test and obtain the required performance data. One feasible solution is to save the result of a + b and write it to global memory each time, as follows:

__global__ void kernel(Integer *out, const Integer *in_a, const Integer *in_b) {
       uint32_t tid = blockDim.x * blockIdx.x + threadIdx.x;
       Integer a = in_a[tid], b = in_b[tid];
       for (uint32_t i = 0; i < 100000; i++) {
              Integer t = a + b;
              //Write the value of t to a location in the array out.
       }
}

there are indeed two issues.

  1. Due to the consistent result of a + b within each iteration, the compiler often optimizes the computation of a + b outside the for loop. Therefore, in practice, the a + b operation is executed only once.
  2. There are a large number of write operations to memory within the for loop, which can significantly impact performance.

To address these issues, I tried using an approach similar to the DoNotOptimizeAway method in thenanobench library. The core idea of this method is to make the compiler believe that the result of a + b will be used as input in a subsequent operation, and since the subsequent operation has a side effect, it cannot be optimized away.
After some experimentation, I successfully preserved the operation a + b; in the generated PTX code without the need to introduce additional instructions. However, PTX code is not the final machine code; it still needs to undergo compilation and optimization by ptxas to become the final machine instructions that can be executed on the hardware. Even after the optimization by ptxas , the a + b; operation is still optimized away.
How can I ensure that operations like a + b; are retained in the final executable machine code sequence even when optimization is enabled, in order to test their performance?

you could have only thread 0 write to output.

#include <cstdint>

class Integer {
public:
    __host__ __device__
    Integer(){
        for(int i = 0; i < 4; i++){
            value[i] = 0;
        }
    }

    __host__ __device__
    Integer& operator+=(const Integer& rhs) {
        for(int i = 0; i < 4; i++){
            value[i] += rhs.value[i];
        }
        return *this; 
    }
 
    __host__ __device__
    friend Integer operator+(Integer lhs, const Integer& rhs)
    {
        lhs += rhs;
        return lhs;
    }

    int value[4];
};

__global__ void kernel(Integer *out, const Integer *in_a, const Integer *in_b) {
    uint32_t tid = blockDim.x * blockIdx.x + threadIdx.x;
    Integer a = in_a[tid], b = in_b[tid];
    Integer sum;
    for (uint32_t i = 0; i < 100000; i++) {
        sum += a + b;
    }
    if(tid == 0){
        out[0] = sum;
    }
}

Hi @striker159 ,
I think your solution may not address the problem I’m facing. Firstly, even though the code to determine whether to write the computation result to global memory is placed at the end of the function, the compiler is very likely to optimize the code in a way that only the thread with tid equal to 0 will execute the for loop, as shown in the following code:

__global__ void kernel(Integer *out, const Integer *in_a, const Integer *in_b) {
    uint32_t tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid == 0){
        Integer a = in_a[tid], b = in_b[tid];
        Integer sum;
        for (uint32_t i = 0; i < 100000; i++) {
            sum += a + b;
        }
        out[0] = sum;
    }
}

Secondly, it is necessary to introduce a new call to the operator+= function within the for loop, while what I need to test is only the operator+ function. The addition of this new function call will undoubtedly affect the accuracy of the test.

You can easily check if the code is only executed on thread 0. From what I can tell from the SASS code here Compiler Explorer , all threads execute the for loop.

The loop will effectively execute operator+ / operator+= twice since one is implemented using the other. From your description it is not clear to me why this could be a problem. It should be the same than doing 200000 iterations instead of 100000 in your original code.

another possible approach is to do a data-dependent store:

if (sum.value[0] == -1) out[0] = sum;

If you can arrange the data involved so that the if test is never satisfied, then the store operation will not take place. This does “cost” something to do the if-test, of course.

I looked at the SASS code, and as you mentioned, indeed, each thread executed the for loop. This makes me curious why the compiler did not optimize it to first check if tid is 0 and then execute the for loop based on the result.
In this example, it is true that executing one += operation and one + operation within the for loop is equivalent in terms of computational workload to executing two + operations. However, I want to discuss a more general issue. My goal is to measure the performance of any piece of code without introducing extraneous factors. What I might want to measure could be a binary operation with complex computational logic, and the += operation and + operation may require vastly different implementations. Even if the algorithms are similar, the compiled results may differ significantly. Therefore, it is advisable to avoid introducing additional code within the loop body.

This approach should be more reliable than checking if tid equals 0. After all, the compiler cannot conditionally execute the for loop based on whether the if condition is satisfied, as there is a dependency between the for loop and the subsequent data writes. However, this method seems challenging to make a universal solution, as it would require creating specific inputs for each set of computational logic to ensure that their results cannot be a specific value.

The solution I am most interested in achieving is the ability to create a generic testing function using C++ templates, such as the following testing function:

template<typename InputA, typename InputB, typename BinaryMathOp>
__global__ void test_performance(const InputA *pa, const InputB *pb) {
          //Load a from pa and b from pb
          for (uint32_t i = 0; i < 1000000; i++) {
                   BinaryMathOp::Cal(a, b);
          }
}

Using this kernel, I only need to pass a BinaryMathOp class similar to the following to test the performance of all binary operations:

class Add{
public:
         static __device__ void Cal(const Integer &a, const Integer &b) {
                   //a + b
                   //or a - b
                  //or a * b
                 //or other complicate binary operation
         }
}

A canonical way to make this kind of test loop work is to sum all the results and store this sum at the end. In this way all the operations under test contribute to externally visible state change and the test loop isn’t eliminated by the compiler as dead code. Obviously this requires an initial calibration pass, so the cost of the summing can be subtracted out from the subsequent measurement passes.

Do you have any plans to add a function similar to the DoNotOptimizeAway function in the nanobench library to address such issues?

This method doesn’t seem ideal either. Performance is determined by more than just computational workload. Introducing sum operations may indeed lead to certain differences in code compilation, such as register usage, which could impact actual performance. Ideally, I would want to avoid introducing uncontrollable factors that affect performance. While I haven’t encountered the same issue on the CUDA side so far, on the OpenCL side, I’ve observed that executing result = add(a, b)(+)within a for loop consumes significantly more registers than executing add_assign(a, b)(+=) . I suspect this is because when compiling add(const Integer *a, const Integer *b) , the compiler must treat a and b as constants within the function, preventing the reuse of registers during computation. In contrast, the add_assign(Integer*a, const Integer *b) function can reuse registers for a , resulting in significant register savings and improved parallelization, thus enhancing overall performance. This phenomenon also highlights how subtle differences in source code can lead to significant variations in compiled output. While I haven’t observed a similar issue in CUDA (CUDA compilers appear to be much more efficient than the OpenCL compiler on AMD GPUs), it’s best to avoid making too many assumptions about the compiler when writing generic testing code.

Additionally, the result of addition between types T is not necessarily always of type T. For example, in point addition on an elliptic curve, the calculation of AffinePoint + AffinePoint typically results in a ProjectivePoint . If addition between ProjectivePoint instances is not defined, then it would not be feasible to avoid the optimization of addition operations through summation.

All valid points. However, this has worked well enough for me in the past, including with CUDA. As an engineer, I have always considered my task to achieve useful results to accomplish a particular end goal with whatever is available now. Engineers usually do not have the time to wait for ideal conditions.

I could likewise point out that what you appear to want to measure results in information that is not particular useful to assess any actual execution scenarios, precisely because these are complex on all modern high-performance processors (CPU and GPU).

I’m not aware of any such plans.