How to prevent ptxas from reordering code

The following .cu code

extern "C" __global__ void XXX(volatile int *device1,

 	volatile int *device2){

	clock_t t1, t2;

	unsigned int tmp;

	asm volatile ("mov.u32 %0, %%clock;\n\t"

 	"ld.global.u32 %2, [%3];\n\t"

 	"add.u32 %2, %2, 2;\n\t"

 	"mov.u32 %1, %%clock;\n\t"

 	: "=r"(t1), "=r"(t2), "=r"(tmp)

 	: "r"(device1)

 	: "memory");

	device1[42] = t2 - t1;

}

is translated to

nvcc -arch=compute_20 -code=sm_20 --ptxas-options=-v,-O0 --opencc-options=-O0 --cubin test.cu

cuobjdump --dump-sass test.cubin

	code for sm_20

		Function : XXX

	/*0000*/ 	/*0x00005de428004404*/ 	MOV R1, c [0x1] [0x100];

	/*0008*/ 	/*0x40001c042c000001*/ 	S2R R0, SR_ClockLo;

	/*0010*/ 	/*0x08001c035000c000*/ 	IMUL.U32.U32 R0, R0, 0x2;

	/*0018*/ 	/*0x40009c042c000001*/ 	S2R R2, SR_ClockLo;

	/*0020*/ 	/*0x08209c035000c000*/ 	IMUL.U32.U32 R2, R2, 0x2;

	/*0028*/ 	/*0x00201d0348000000*/ 	IADD R0, R2, -R0;

	/*0030*/ 	/*0x80009de218000000*/ 	MOV32I R2, 0x20;

	/*0038*/ 	/*0x00209c8614000000*/ 	LDC R2, c [0x0] [R2];

	/*0040*/ 	/*0xa0209c034800c002*/ 	IADD R2, R2, 0xa8;

	/*0048*/ 	/*0x00201f8590000000*/ 	ST.WT [R2], R0;

	/*0050*/ 	/*0x00001de780000000*/ 	EXIT;

So operations are reordered despite disabled optimization and volatile declaration. Instead of measuring the clock cyles between loading from device memory, the program measures the time required for a multiplication.

How can I convince ptxas to not reorder this code, but to translate it as-is, line-by-line?

nvcc --version

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2011 NVIDIA Corporation

Built on Sun_Mar_20_16:47:09_PDT_2011

Cuda compilation tools, release 4.0, V0.2.1221