1w-1l1.cu (2.61 KB)
Hello,
Consider the attached (naive) attempt to measure the number of clock cycles required to read/write from global or host mapped memory.
When looking at the PTX file generated with
nvcc -arch=compute_20 -code=sm_20 -G0 -O0 --ptxas-options=-v,-O0 --opencc-options=-O0 --ptx 1w-1l1.cu
the CUDA code
t1 = clock();
tmp = data[threadIdx.x];
is translated to
.loc 16 22 0
mov.u32 %r1, %clock;
mov.s32 %r2, %r1;
mov.s32 %r3, %r2;
.loc 16 23 0
ld.param.u32 %r4, [__cudaparm__Z6kernelPfPiS0__data];
mov.u32 %r5, %tid.x;
mul.lo.u32 %r6, %r5, 4;
add.u32 %r7, %r4, %r6;
ld.global.f32 %f1, [%r7+0];
mov.f32 %f2, %f1;
Why is there so much data duplication? E.g. the clock tick is saved in 3 registers and the loaded data saved in 2 registers.