I am running following type of a very simple code fragment (on device)to measure latency of ‘add’ instruction :
start_time = clock();
c_ = a_+b_;
end_time = clock ();
long num_clocks = end_time-start_time;
I have following two issues :
- When I run the code without any optimization flags, in the ptx, the end_time clock instruction occurs before the add instruction. Something like:
mov.u32 %r5, %clock; //start_time
mov.s32 %r6, %r5;
cvt.s64.s32 %rd1, %r6;
mov.u32 %r7, %clock; //end_time
mov.s32 %r8, %r7;
cvt.s64.s32 %rd2, %r8;
add.f32 %f3, %f1, %f2;
I could avoid this with the flag -Xopencc “-O0”. Can somebody please explain me the behavior and if the optimization flag is the only way to force the clock surround the add instruction?
- With the optimization flag on, following ptx is generated:
mov.u32 %r5, %clock; //start_time
mov.s32 %r6, %r5;
cvt.s64.s32 %rd1, %r6;
.loc 16 19 0
add.f32 %f3, %f1, %f2; //add instruction
.loc 16 20 0
mov.u32 %r7, %clock; //end_time
mov.s32 %r8, %r7;
cvt.s64.s32 %rd2, %r8;
My concern is, since between the two clocks, there are other mov/cvt instructions which will add up to my calculation of the latency of the ‘add’ instruction, how do I measure the latency of just the ‘add’ instruction ?
Details:
CUDA 4.0 on Linux 64 bit
Compiling with -arch=sm_20
GEFORCE 540M