fpxa optimization makes memory latency higher?

UPDATE: The “fpxa” in the subject should of course be “ptxas”.

Hello,

Consider the attached program to measure the latency of reading from global memory. When compiled with

nvcc -arch=compute_20 -code=sm_20 --ptxas-options=-v,-O1 --opencc-options=-O0 -o 1w-1l1 1w-1l1.cu

the output is

Reading: 632.75 +/- 2.26

Writing: 1484.98 +/- 5.38

If ptxas is told not to optimize, reading seems to be twice as fast:

nvcc -arch=compute_20 -code=sm_20 --ptxas-options=-v,-O0 --opencc-options=-O0 -o 1w-1l1 1w-1l1.cu

Reading: 342.40 +/- 1.62

Writing: 1490.56 +/- 3.79

Running objdump on the cubin files does not reveal anything that would seem major to me. The “optimized” version just uses a left shift instead of a multiplication by two:

/*0018*/ 	/*0x40009c042c000001*/ 	S2R R2, SR_ClockLo;

	/*0020*/ 	/*0x04209e036000c000*/ 	SHL R2, R2, 0x1;

	/*0028*/ 	/*0x00001c8580000000*/ 	LD R0, [R0];

	/*0030*/ 	/*0x00001c05e0000000*/ 	MEMBAR.CTA;

	/*0038*/ 	/*0x00001c45e0000000*/ 	MEMBAR.SYS;

	/*0040*/ 	/*0x00000007d0000000*/ 	BPT.DRAIN 0x0;

	/*0048*/ 	/*0x4000dc042c000001*/ 	S2R R3, SR_ClockLo;

while unoptimized:

/*0030*/ 	/*0x40009c042c000001*/ 	S2R R2, SR_ClockLo;

	/*0038*/ 	/*0x08209c035000c000*/ 	IMUL.U32.U32 R2, R2, 0x2;

	/*0040*/ 	/*0x00001c8580000000*/ 	LD R0, [R0];

	/*0048*/ 	/*0x00001c05e0000000*/ 	MEMBAR.CTA;

	/*0050*/ 	/*0x00001c45e0000000*/ 	MEMBAR.SYS;

	/*0058*/ 	/*0x00000007d0000000*/ 	BPT.DRAIN 0x0;

	/*0060*/ 	/*0x4000dc042c000001*/ 	S2R R3, SR_ClockLo;

Why is the first version so slow?

As a side remark, I’m also curious why the clock cycle is multiplied by two and what the effect of the “BPT.DRAIN” is…
1w-1l1.cu (3.06 KB)