Instruction reordering and measuring latency of instructions

I am running following type of a very simple code fragment (on device)to measure latency of ‘add’ instruction :

start_time = clock();

c_ = a_+b_;

end_time = clock ();

long num_clocks = end_time-start_time;

I have following two issues :

  1. When I run the code without any optimization flags, in the ptx, the end_time clock instruction occurs before the add instruction. Something like:
mov.u32         %r5, %clock; //start_time

mov.s32         %r6, %r5;

cvt.s64.s32     %rd1, %r6;

mov.u32         %r7, %clock; //end_time

mov.s32         %r8, %r7;

cvt.s64.s32     %rd2, %r8;

add.f32         %f3, %f1, %f2;

I could avoid this with the flag -Xopencc “-O0”. Can somebody please explain me the behavior and if the optimization flag is the only way to force the clock surround the add instruction?

  1. With the optimization flag on, following ptx is generated:
mov.u32         %r5, %clock; //start_time

        mov.s32         %r6, %r5;

        cvt.s64.s32     %rd1, %r6;

        .loc    16      19      0

        add.f32         %f3, %f1, %f2; //add instruction

        .loc    16      20      0

        mov.u32         %r7, %clock; //end_time

        mov.s32         %r8, %r7;

        cvt.s64.s32     %rd2, %r8;

My concern is, since between the two clocks, there are other mov/cvt instructions which will add up to my calculation of the latency of the ‘add’ instruction, how do I measure the latency of just the ‘add’ instruction ?

Details:

CUDA 4.0 on Linux 64 bit

Compiling with -arch=sm_20

GEFORCE 540M

https://groups.google.com/forum/?nomobile=true#!topic/asfermi/eEjCVpYpZ-s
Look towarda the end and note that all numbers are scheduler clock numbershttps://groups.google.com/forum/?nomobile=true#!topic/asfermi/eEjCVpYpZ-s
Look towarda the end and note that all numbers are scheduler clock numbers