The following .cu code
extern "C" __global__ void XXX(volatile int *device1,
volatile int *device2){
clock_t t1, t2;
unsigned int tmp;
asm volatile ("mov.u32 %0, %%clock;\n\t"
"ld.global.u32 %2, [%3];\n\t"
"add.u32 %2, %2, 2;\n\t"
"mov.u32 %1, %%clock;\n\t"
: "=r"(t1), "=r"(t2), "=r"(tmp)
: "r"(device1)
: "memory");
device1[42] = t2 - t1;
}
is translated to
nvcc -arch=compute_20 -code=sm_20 --ptxas-options=-v,-O0 --opencc-options=-O0 --cubin test.cu
cuobjdump --dump-sass test.cubin
code for sm_20
Function : XXX
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x40001c042c000001*/ S2R R0, SR_ClockLo;
/*0010*/ /*0x08001c035000c000*/ IMUL.U32.U32 R0, R0, 0x2;
/*0018*/ /*0x40009c042c000001*/ S2R R2, SR_ClockLo;
/*0020*/ /*0x08209c035000c000*/ IMUL.U32.U32 R2, R2, 0x2;
/*0028*/ /*0x00201d0348000000*/ IADD R0, R2, -R0;
/*0030*/ /*0x80009de218000000*/ MOV32I R2, 0x20;
/*0038*/ /*0x00209c8614000000*/ LDC R2, c [0x0] [R2];
/*0040*/ /*0xa0209c034800c002*/ IADD R2, R2, 0xa8;
/*0048*/ /*0x00201f8590000000*/ ST.WT [R2], R0;
/*0050*/ /*0x00001de780000000*/ EXIT;
So operations are reordered despite disabled optimization and volatile declaration. Instead of measuring the clock cyles between loading from device memory, the program measures the time required for a multiplication.
How can I convince ptxas to not reorder this code, but to translate it as-is, line-by-line?
nvcc --version
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2011 NVIDIA Corporation
Built on Sun_Mar_20_16:47:09_PDT_2011
Cuda compilation tools, release 4.0, V0.2.1221