UPDATE: The “fpxa” in the subject should of course be “ptxas”.
Hello,
Consider the attached program to measure the latency of reading from global memory. When compiled with
nvcc -arch=compute_20 -code=sm_20 --ptxas-options=-v,-O1 --opencc-options=-O0 -o 1w-1l1 1w-1l1.cu
the output is
Reading: 632.75 +/- 2.26
Writing: 1484.98 +/- 5.38
If ptxas is told not to optimize, reading seems to be twice as fast:
nvcc -arch=compute_20 -code=sm_20 --ptxas-options=-v,-O0 --opencc-options=-O0 -o 1w-1l1 1w-1l1.cu
Reading: 342.40 +/- 1.62
Writing: 1490.56 +/- 3.79
Running objdump on the cubin files does not reveal anything that would seem major to me. The “optimized” version just uses a left shift instead of a multiplication by two:
/*0018*/ /*0x40009c042c000001*/ S2R R2, SR_ClockLo;
/*0020*/ /*0x04209e036000c000*/ SHL R2, R2, 0x1;
/*0028*/ /*0x00001c8580000000*/ LD R0, [R0];
/*0030*/ /*0x00001c05e0000000*/ MEMBAR.CTA;
/*0038*/ /*0x00001c45e0000000*/ MEMBAR.SYS;
/*0040*/ /*0x00000007d0000000*/ BPT.DRAIN 0x0;
/*0048*/ /*0x4000dc042c000001*/ S2R R3, SR_ClockLo;
while unoptimized:
/*0030*/ /*0x40009c042c000001*/ S2R R2, SR_ClockLo;
/*0038*/ /*0x08209c035000c000*/ IMUL.U32.U32 R2, R2, 0x2;
/*0040*/ /*0x00001c8580000000*/ LD R0, [R0];
/*0048*/ /*0x00001c05e0000000*/ MEMBAR.CTA;
/*0050*/ /*0x00001c45e0000000*/ MEMBAR.SYS;
/*0058*/ /*0x00000007d0000000*/ BPT.DRAIN 0x0;
/*0060*/ /*0x4000dc042c000001*/ S2R R3, SR_ClockLo;
Why is the first version so slow?
As a side remark, I’m also curious why the clock cycle is multiplied by two and what the effect of the “BPT.DRAIN” is…
1w-1l1.cu (3.06 KB)