Strange PTX Output

Hi,

Trying to figure out how best to optimise some of my kernels I looked at the ptx generated and it looks to be creating redundant extra instructions. For example the source line:

register float subi = (float)tid / 512.0f;

gets turned into

.loc 14 143 1

	cvt.rn.f32.s32 	%f41, %r1;

	.loc 14 143 1

	cvt.f64.f32 	%fd1, %f41;

	.loc 14 143 1

	cvt.rn.f32.f64 	%f42, %fd1;

	.loc 14 143 1

	bra.uni	tmp452;

tmp452:

tmp453:

	mov.f32 	%f43, 0f44000000;

	mov.f32 	%f44, %f43;

tmp454:

tmp455:

	.loc 7 1311 3

	div.rn.f32 	%f45, %f42, %f44;

Register f41 doesn’t get used anywhere else, so the conversion to and back from a float64 seems unnecessary, as surely it could just convert %r1 from an s32 to f32 and write into %f42. Similarly with the extra move copying register %f43 to %f44. Am I missing something, or worrying about something that ptxas will re-optimise out?

Cheers,

Tiomat

Yes. Look at ptxas output (using [font=“Courier New”]cuobjdump -sass[/font]), not PTX. It is very different.

Cheers for the tip, but it seems the sass has the weird convert too:

/*00e8*/     /*0x59201e0418000000*/ 	I2F.F32.S32 R0, R22;

	/*00f0*/     /*0x01311c0410000000*/ 	F2F.F64.F32 R4, R0;

	/*00f8*/     /*0x11a01c0410000000*/ 	F2F.F32.F64 R0, R4;

I doubt this is my biggest performance limitation, it was just one of those things that seemed a little odds and jumped out at me. I’ll just chalk it up to a quirk of the compiler and go back to real work finding optimisations that will make a significant improvement.

I agree this does look odd in that the intermediate conversion to double looks unwarranted and superfluous. Are you seeing this with the CUDA 4.2 toolchain, or the CUDA 5.0 preview?

This is currently with 4.1, however I will try to build it with 4.2 and the 5.0 preview when I get chance to see if the problem still exists.

Update: 4.2 Still has the same problem, will try with the 5.0 preview sometime later.

Cheers,

I am unable to reproduce the reported behavior with CUDA 4.2. My test program is as follows:

#include <stdio.h>

#include <stdlib.h>

__global__ void kernel (float *res)

{

    int tid = threadIdx.x;

    register float subi = (float)tid / 512.0f;

    *res = subi;

}

int main (void)

{

    float *res_d = 0;

    cudaMalloc ((void**)&res_d, sizeof(res_d));

    kernel<<<1,1>>>(res_d);

    return EXIT_SUCCESS;

}

I compiled with

nvcc -arch=sm_20 -keep -o test_conv test_conv.cu

The resulting machine code disassembly is as follows:

code for sm_20

                Function : _Z6kernelPf

        /*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];

        /*0008*/     /*0x84001c042c000000*/     S2R R0, SR_Tid_X;

        /*0010*/     /*0x00015de219100000*/     MOV32I R5, 0x44000000;

        /*0018*/     /*0x80009de428004000*/     MOV R2, c [0x0] [0x20];

        /*0020*/     /*0x9000dde428004000*/     MOV R3, c [0x0] [0x24];

        /*0028*/     /*0x01211e0418000000*/     I2F.F32.S32 R4, R0;

        /*0030*/     /*0x4001000750000000*/     CAL 0x48;               // this is the float division

        /*0038*/     /*0x00211c8594000000*/     ST.E [R2], R4;

        /*0040*/     /*0x00001de780000000*/     EXIT;

What is the smallest self-contained repro code that you have found that exhibits the suboptimal code generation?

I found this post whilst having the exact same issue; I found it because my single-precision-only Kernel was spending a huge amount of time doing float <> double conversions.

Turns out all these conversions were eliminated by ptxas when I removed debugging information from my Cuda build. I suspect its something to do with the build setting the --dont-merge-basicblocks flag for debug builds, but haven’t investigate further.

I still don’t understand why they’re there in the first place, but they can be easily removed in final code by disabling debugging.

In debug builds, all compiler optimizations are disabled, and the compiler reverts to basically a literal rendering of the source code into machine code. This also means that any type conversions that are mandated by abstract C++ semantics will be retained in the code. As far as I am aware, this needs to be done for at least the following two reasons:

(1) Every instruction must be identifiably associated with a particular line in the source code, so code breakpoints can be applied correctly.

(2) Every variable must be examinable by the programmer at any time during its life time as determined by HLL semantics.

There is really no point in looking at the PTX or SASS code generated by debug builds. Even for release builds, there is little point in looking at generated PTX code, as this is further compiled (not assembled) to SASS code. The SASS code embedded in an executable, which can be dumped by cuobjdump --dump-sass is worth looking at from a performance perspective.

Thanks for the info. C++ compilers usually make a differentiation between “debugging information” (such as line numbers, symbol information, etc.) and optimization. I think most people set up a “Debug” build to have “debugging information and no optimization” while Release flips both settings.

It’s possible to create a “Debug Release” build in C++ that has both debugging information and optimizations enabled, which can be very useful for profiling (since the debugging information doesn’t interfere with optimization) and debugging Release build bugs.

From what you’re saying, it sounds like that’s not possible in Cuda, and is good to know. It would be useful to see which source code line roughly maps to which SASS line in optimized builds though.

I have analyzed a lot of SASS code over the years and in general it is close to impossible to map back individual SASS instructions to a source line in a release build. Functions often get inlined, branches disappear through if-conversion causing basic blocks to be merged, common sub-expressions are computed into temporary variables, induction variables get created for index arithmetic in loops, and finally SASS instructions get scheduled to increase latency tolerance, often moving them over large distances (tens of instructions).

As a consequence, some source code lines may result in no SASS code whatsoever, while SASS instructions loosely representing other source code lines are spread out all over the place. Some variables that exist at source code level may cease to exist at SASS level, having been completely incorporated into common sub-expressions or compiler-generated induction variables.