How to write micro-benchmark to test latency of specific instruction

Hi, I want to write some code to test latency of specific instruction like add.u32. But the optimizer always optimize code and the final sass code is not exactly what I want. If i disable optimizer, many mov instructions is inserted.

For example,
I write device code as blow:

__global__
void new_u32_add_latency(long long *latency, int *result, int a, int b) {
    long long start, end;
    do {
        __syncwarp(0xffffffff);
        asm ("mov.u64 %1, %%clock64;\n\t"
            "add.u32 %0, %0, %3;\n\t"
            "mov.u64 %2, %%clock64;\n\t"
            :"+r"(a), "=l"(start), "=l"(end): "r"(b));
        __syncwarp(0xffffffff);
    }while(start >= end);
    *latency = end - start;
    *result = a + b;
}

The output of nvdisasm:

Well, the third parameter of IADD3 comes from constant memory which is not what I want.
I tried to add a temp register and manually move the parameter b in constant memory to the temp register. However, the compiler optimized these out and the third parameter still comes from constant memory.
Latter I tried to disable optimizer, but I still can’t get expected sass code. Many mov instructions is inserted between the two mov instructions like blow:

How to “customize” the final SASS code?

Please post code as marked-up text, not as images. Images are neither searchable nor easily accessible to visually impaired people.

The CUDA compiler (and any recent compiler for that matter) performs dead code analysis and removes “dead” code. Any computation that does not contribute to modification of globally visible state (what is in global memory after the kernel ends) might as well not be performed and is thus considered dead.

A standard way of measuring latency is to construct a (usually lengthy) dependency chain, and then write out the result of that dependency chain at the end. In most cases this can be done at the HLL level. One way to measure the latency of ADDs would be to read out a global memory location, say a[thread_idx], then add to it an increment that is not a compile-time constant, say 500 times, using straight-line code, then write the result back to a[thread_idx]. Now measure the elapsed time for doing all these increments and divide by 500.

Note that the compiler also reorders independent instruction the way it sees fit. Simply reading out the clock register is an independent operation. If you need to delay the reading out of the clock until the sum is done, you will need to create a dependency. Here, you might check for the value of the computed sum and invoke the second clock() call only if that sum satisfies a condition for which the compiler cannot establish the outcome at compile time, but you as the programmer can guarantee that the condition is true by construction.

Recent GPU architectures are optimized for 3-input operations, meaning 2-input operations are usually mapped to the equivalent 3-input operation where the third input is zero. For the zero operand the compiler typically uses the dedicated zero register, RZ. I am not aware of a way to change that, and in fact I am not even sure instructions like 2-input IADD and 2-input LOP still physically exist on Ampere. You could try reducing the optimization level of the back-end compiler ptxas by specifying -Xptxas -O1, but I don’t think it will make a difference.

The many MOV instructions in your one snippet probably resulted from use of -Xptxas -O0, which is (I think) used for debug builds only. From observation, with that setting the compiler practically performs “pessimization” to allow maximum observability for debuggers. The MOVs are artifacts of that approach.

I know there is some method to prevent optimizer modify my code. But these methods are case by case.
Is there anyway to tell optimizer not change my embedded ptx code or just write SASS code directly?

Since you posted images, I did not look at your code closely and just addressed typical problems that crop up in such endeavors.

PTX code is compiled into SASS by an optimizing compiler, ptxas. Specifying -Xptxas -O1 is as close as you can get to ptxas not optimizing the code it produces.

There is no public NVIDIA-supported tool for programming in SASS directly. Various people have reverse engineered portions of the instruction set and steering words and written rudimentary SASS assemblers. But since the details change with every GPU architecture, I don’t think any of these efforts were sustained for more than one GPU generation. Various people have published reports about reverse engineering specific GPU miroarchitectures; I don’t know whether they made their framewqorks publicly available.

In general, for simple instructions like integer add thee should be no need to drop down to the assembly language level to measure instruction latency.

In many cases, warp-uniform access to constant memory is as fast as a register access, and especially if there is no data re-use, the compiler may not deem it necessary to pull a kernel argument or literal constant out of constant memory into a register first. The long dependency chain that I suggested will create data re-use and may entice the compiler to pull the increment into a register. Otherwise, try pre-processing the increment for the addition through a math-function first so that the compiler cannot use the constant-memory data directly, as-is.

What are you trying to accomplish by determining the latency of an integer add on the the GPU? GPUs are designed as throughput machines, and to first order instruction latencies should not matter. For Volta and Pascal the latencies of various instructions are reported in this publication:

Zhe Jia, Marco Maggioni, Benjamin Staiger, Daniele P. Scarpazza, “Dissecting the
NVIDIA Volta GPU Architecture via Microbenchmarking,”, Citadel Technical Report, April 2018

The same team gave a presentation on the Ampere architecture at GTC 2021 but I cannot find the PDF for that right now. Check NVIDIA’s online archive of GTC presentations (requires developer login, same as for this forum).