I’m trying to create microbenchmarks to measure the latency of certain arithmetic instructions, but the PTX optimization always jumbles up my code. I’m using the volatile keyword to stop this optimization.
__global__ void test_latency(clock_t *time) {
int start = 0, end = 0;
volatile int test1 = 0x00C0FFEE;
volatile int test2 = 0xDEADBEEF;
asm volatile (
"mov.u32 %0, %%clock;\n\t"
"add.s32 %2, %2, %3;\n\t"
"mov.u32 %1, %%clock;\n\t"
: "=r"(start)
, "=r"(end)
, "+r"(test1)
, "=r"(test2)
:: "memory"
);
*time = (clock_t)(end - start);
}
However, nvcc still optimizes my code. When I compile it with default optimization levels, the IADD instruction gets optimized away completely, leaving me with:
S2R R2, SR_CLOCKLO;
S2R R3, SR_CLOCKLO;
which only measures the latency of the clock function.
When I disable all optimizations with the -Xptxas -O0 flag, I get:
S2R R2, SR_CLOCKLO;
MOV R2, R2;
MOV R2, R2;
IADD R0, R0, R0;
MOV R0, R0;
S2R R0, SR_CLOCKLO;
MOV R0, R0;
MOV R0, R0;
which not only measures the latency of the IADD instruction, but also of three moves that were added to the assembler, despite the volatile keyword.
What I need is something like this:
S2R R2, SR_CLOCKLO;
IADD R0, R0, R1;
S2R R3, SR_CLOCKLO;
where the variables temp1 and temp2 would sit in R0 and R1, respectively.
So, why does my code still get optimized, despite using the volatile keyword, and how to I stop nvcc from doing that?