Consider a class representing large integers, for which I have implemented the
class Integer {
public:
Integer operator+(const Integer& other) const {
//Some code
}
int value[4];
}
function. Now, I need to test the performance of this function. To do this, I have written the following test kernel function:
__global__ void kernel(Integer *out, const Integer *in_a, const Integer *in_b) {
uint32_t tid = blockDim.x * blockIdx.x + threadIdx.x;
Integer a = in_a[tid], b = in_b[tid];
for (uint32_t i = 0; i < 100000; i++) {
a + b;
}
//Do not write result of a + b to global memory this time.
}
Clearly, the above code, when optimized by the compiler, will optimize out the side-effect-free code a + b;, making it impossible to correctly test and obtain the required performance data. One feasible solution is to save the result of a + b and write it to global memory each time, as follows:
__global__ void kernel(Integer *out, const Integer *in_a, const Integer *in_b) {
uint32_t tid = blockDim.x * blockIdx.x + threadIdx.x;
Integer a = in_a[tid], b = in_b[tid];
for (uint32_t i = 0; i < 100000; i++) {
Integer t = a + b;
//Write the value of t to a location in the array out.
}
}
there are indeed two issues.
- Due to the consistent result of a + b within each iteration, the compiler often optimizes the computation of a + b outside the for loop. Therefore, in practice, the a + b operation is executed only once.
- There are a large number of write operations to memory within the for loop, which can significantly impact performance.
To address these issues, I tried using an approach similar to the DoNotOptimizeAway
method in thenanobench library. The core idea of this method is to make the compiler believe that the result of a + b
will be used as input in a subsequent operation, and since the subsequent operation has a side effect, it cannot be optimized away.
After some experimentation, I successfully preserved the operation a + b;
in the generated PTX code without the need to introduce additional instructions. However, PTX code is not the final machine code; it still needs to undergo compilation and optimization by ptxas
to become the final machine instructions that can be executed on the hardware. Even after the optimization by ptxas
, the a + b;
operation is still optimized away.
How can I ensure that operations like a + b;
are retained in the final executable machine code sequence even when optimization is enabled, in order to test their performance?