Understanding Generated PTX code

1w-1l1.cu (2.61 KB)


Consider the attached (naive) attempt to measure the number of clock cycles required to read/write from global or host mapped memory.

When looking at the PTX file generated with

nvcc -arch=compute_20 -code=sm_20 -G0 -O0 --ptxas-options=-v,-O0 --opencc-options=-O0 --ptx 1w-1l1.cu

the CUDA code

t1 = clock();

	tmp = data[threadIdx.x];

is translated to

.loc	16 	22 	0

 	mov.u32 	%r1, %clock;

 	mov.s32 	%r2, %r1;

 	mov.s32 	%r3, %r2;

 	.loc	16 	23 	0

 	ld.param.u32	%r4, [__cudaparm__Z6kernelPfPiS0__data];

 	mov.u32 	%r5, %tid.x;

 	mul.lo.u32 	%r6, %r5, 4;

 	add.u32 	%r7, %r4, %r6;

 	ld.global.f32 %f1, [%r7+0];

 	mov.f32 	%f2, %f1;

Why is there so much data duplication? E.g. the clock tick is saved in 3 registers and the loaded data saved in 2 registers.

Ptx is not the actual code that will run, it will be optimized afterwards by ptxas. You can generate a .cubin (with --keep) and disassemble the cubin, either with decuda, or the tool that nvidia provides, but for sm_20 you need the nvidia tool that is part of 4.0RC as far as I know. So generate an sm_13 cubin and disassemble.