Strange PTX Output

Tiomat · June 28, 2012, 1:24pm

Hi,

Trying to figure out how best to optimise some of my kernels I looked at the ptx generated and it looks to be creating redundant extra instructions. For example the source line:

register float subi = (float)tid / 512.0f;

gets turned into

.loc 14 143 1

	cvt.rn.f32.s32 	%f41, %r1;

	.loc 14 143 1

	cvt.f64.f32 	%fd1, %f41;

	.loc 14 143 1

	cvt.rn.f32.f64 	%f42, %fd1;

	.loc 14 143 1

	bra.uni	tmp452;

tmp452:

tmp453:

	mov.f32 	%f43, 0f44000000;

	mov.f32 	%f44, %f43;

tmp454:

tmp455:

	.loc 7 1311 3

	div.rn.f32 	%f45, %f42, %f44;

Register f41 doesn’t get used anywhere else, so the conversion to and back from a float64 seems unnecessary, as surely it could just convert %r1 from an s32 to f32 and write into %f42. Similarly with the extra move copying register %f43 to %f44. Am I missing something, or worrying about something that ptxas will re-optimise out?

Cheers,

Tiomat

tera · June 28, 2012, 2:41pm

Yes. Look at ptxas output (using [font=“Courier New”]cuobjdump -sass[/font]), not PTX. It is very different.

Tiomat · June 29, 2012, 8:21am

Cheers for the tip, but it seems the sass has the weird convert too:

/*00e8*/     /*0x59201e0418000000*/ 	I2F.F32.S32 R0, R22;

	/*00f0*/     /*0x01311c0410000000*/ 	F2F.F64.F32 R4, R0;

	/*00f8*/     /*0x11a01c0410000000*/ 	F2F.F32.F64 R0, R4;

I doubt this is my biggest performance limitation, it was just one of those things that seemed a little odds and jumped out at me. I’ll just chalk it up to a quirk of the compiler and go back to real work finding optimisations that will make a significant improvement.

njuffa · June 29, 2012, 5:30pm

I agree this does look odd in that the intermediate conversion to double looks unwarranted and superfluous. Are you seeing this with the CUDA 4.2 toolchain, or the CUDA 5.0 preview?

Tiomat · July 2, 2012, 8:22am

This is currently with 4.1, however I will try to build it with 4.2 and the 5.0 preview when I get chance to see if the problem still exists.

Update: 4.2 Still has the same problem, will try with the 5.0 preview sometime later.

Cheers,

njuffa · July 2, 2012, 10:31pm

I am unable to reproduce the reported behavior with CUDA 4.2. My test program is as follows:

#include <stdio.h>

#include <stdlib.h>

__global__ void kernel (float *res)

{

    int tid = threadIdx.x;

    register float subi = (float)tid / 512.0f;

    *res = subi;

}

int main (void)

{

    float *res_d = 0;

    cudaMalloc ((void**)&res_d, sizeof(res_d));

    kernel<<<1,1>>>(res_d);

    return EXIT_SUCCESS;

}

I compiled with

nvcc -arch=sm_20 -keep -o test_conv test_conv.cu

The resulting machine code disassembly is as follows:

code for sm_20

                Function : _Z6kernelPf

        /*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];

        /*0008*/     /*0x84001c042c000000*/     S2R R0, SR_Tid_X;

        /*0010*/     /*0x00015de219100000*/     MOV32I R5, 0x44000000;

        /*0018*/     /*0x80009de428004000*/     MOV R2, c [0x0] [0x20];

        /*0020*/     /*0x9000dde428004000*/     MOV R3, c [0x0] [0x24];

        /*0028*/     /*0x01211e0418000000*/     I2F.F32.S32 R4, R0;

        /*0030*/     /*0x4001000750000000*/     CAL 0x48;               // this is the float division

        /*0038*/     /*0x00211c8594000000*/     ST.E [R2], R4;

        /*0040*/     /*0x00001de780000000*/     EXIT;

What is the smallest self-contained repro code that you have found that exhibits the suboptimal code generation?

funnytimes · December 19, 2014, 12:32am

I found this post whilst having the exact same issue; I found it because my single-precision-only Kernel was spending a huge amount of time doing float <> double conversions.

Turns out all these conversions were eliminated by ptxas when I removed debugging information from my Cuda build. I suspect its something to do with the build setting the --dont-merge-basicblocks flag for debug builds, but haven’t investigate further.

I still don’t understand why they’re there in the first place, but they can be easily removed in final code by disabling debugging.

njuffa · December 19, 2014, 1:59am

In debug builds, all compiler optimizations are disabled, and the compiler reverts to basically a literal rendering of the source code into machine code. This also means that any type conversions that are mandated by abstract C++ semantics will be retained in the code. As far as I am aware, this needs to be done for at least the following two reasons:

(1) Every instruction must be identifiably associated with a particular line in the source code, so code breakpoints can be applied correctly.

(2) Every variable must be examinable by the programmer at any time during its life time as determined by HLL semantics.

There is really no point in looking at the PTX or SASS code generated by debug builds. Even for release builds, there is little point in looking at generated PTX code, as this is further compiled (not assembled) to SASS code. The SASS code embedded in an executable, which can be dumped by cuobjdump --dump-sass is worth looking at from a performance perspective.

funnytimes · December 19, 2014, 7:35pm

Thanks for the info. C++ compilers usually make a differentiation between “debugging information” (such as line numbers, symbol information, etc.) and optimization. I think most people set up a “Debug” build to have “debugging information and no optimization” while Release flips both settings.

It’s possible to create a “Debug Release” build in C++ that has both debugging information and optimizations enabled, which can be very useful for profiling (since the debugging information doesn’t interfere with optimization) and debugging Release build bugs.

From what you’re saying, it sounds like that’s not possible in Cuda, and is good to know. It would be useful to see which source code line roughly maps to which SASS line in optimized builds though.

njuffa · December 19, 2014, 8:06pm

I have analyzed a lot of SASS code over the years and in general it is close to impossible to map back individual SASS instructions to a source line in a release build. Functions often get inlined, branches disappear through if-conversion causing basic blocks to be merged, common sub-expressions are computed into temporary variables, induction variables get created for index arithmetic in loops, and finally SASS instructions get scheduled to increase latency tolerance, often moving them over large distances (tens of instructions).

As a consequence, some source code lines may result in no SASS code whatsoever, while SASS instructions loosely representing other source code lines are spread out all over the place. Some variables that exist at source code level may cease to exist at SASS level, having been completely incorporated into common sub-expressions or compiler-generated induction variables.

Topic		Replies	Views
CUDA 5.5 produces different PTX code from CUDA 5.0 with degraded performance CUDA Programming and Performance	3	1437	August 15, 2013
CUDA low-level programming - strange ptxas behavior CUDA Programming and Performance	4	1477	February 17, 2014
Can't make ptxas generate efficient code CUDA Programming and Performance	23	4410	December 30, 2012
How to tell nvcc that some `if` must diverge and stop trying to fuse previous statements into it? CUDA Programming and Performance	20	464	March 3, 2024
performance difference for cuda between experiments and the documentation for float/double data type... CUDA Programming and Performance	8	1903	October 28, 2016
PTX instructions are reordered CUDA Programming and Performance	12	1507	May 13, 2024
Ptxas compiler speed. CUDA Programming and Performance	23	12159	December 20, 2012
Cuda compiler will optimize code to use more registers than available by attempting to cache parameters CUDA Programming and Performance	12	2287	November 14, 2017
cuda memory usage in debug(with GDB),debug(without GDB) and release differ, extra 2GB usage in relea CUDA Programming and Performance	11	4208	February 9, 2016
Crowd sourcing request: help me time the PTX ISA. CUDA Programming and Performance	8	1898	July 2, 2019

Strange PTX Output

Related topics