Examining the generated .ptx file

Have pushed my CUDA C kernel code as far as it can go in terms of obvious optimizations. Generated the .ptx file and look at the CUDA PTX ISA document, which helps figure out interpret some of the lower-level details.

Is there such thing as ‘low-hanging’ fruit in terms of potentially obvious candidates for low-level optimizations?

Take the following section as an example:

	mul.wide.s32 	%rd19, %r34, 48;
	mov.u64 	%rd20, _Z11simple_backPKfS0_S0_Pffiifiiii$__cuda_local_var_795740_31_non_const_projMat;
	add.s64 	%rd21, %rd20, %rd19;
	ld.shared.f32 	%f16, [%rd21+32];
	ld.shared.f32 	%f17, [%rd21+36];
	mul.ftz.f32 	%f18, %f17, %f2;
	fma.rn.ftz.f32 	%f19, %f16, %f1, %f18;
	ld.shared.f32 	%f20, [%rd21+40];
	fma.rn.ftz.f32 	%f21, %f20, %f3, %f19;
	ld.shared.f32 	%f22, [%rd21+44];
	add.ftz.f32 	%f23, %f21, %f22;
	ld.shared.f32 	%f24, [%rd21];
	ld.shared.f32 	%f25, [%rd21+4];
	mul.ftz.f32 	%f26, %f25, %f2;
	fma.rn.ftz.f32 	%f27, %f24, %f1, %f26;
	ld.shared.f32 	%f28, [%rd21+8];
	fma.rn.ftz.f32 	%f29, %f28, %f3, %f27;
	ld.shared.f32 	%f30, [%rd21+12];
	add.ftz.f32 	%f31, %f29, %f30;
	div.approx.ftz.f32 	%f32, %f31, %f23;
	cvt.rni.ftz.f32.f32	%f33, %f32;
	cvt.rzi.ftz.s32.f32	%r4, %f33;
	ld.shared.f32 	%f34, [%rd21+16];
	ld.shared.f32 	%f35, [%rd21+20];
	mul.ftz.f32 	%f36, %f35, %f2;
	fma.rn.ftz.f32 	%f37, %f34, %f1, %f36;
	ld.shared.f32 	%f38, [%rd21+24];
	fma.rn.ftz.f32 	%f39, %f38, %f3, %f37;
	ld.shared.f32 	%f40, [%rd21+28];
	add.ftz.f32 	%f41, %f39, %f40;
	div.approx.ftz.f32 	%f42, %f41, %f23;
	cvt.rni.ftz.f32.f32	%f43, %f42;
	cvt.rzi.ftz.s32.f32	%r5, %f43;
	setp.gt.s32	%p5, %r4, 0;
	setp.le.s32	%p6, %r4, %r10;
	and.pred  	%p7, %p5, %p6;
	setp.gt.s32	%p8, %r5, 0;
	and.pred  	%p9, %p7, %p8;
	setp.le.s32	%p10, %r5, %r11;
	and.pred  	%p11, %p9, %p10;
	@!%p11 bra 	BB2_10;
	bra.uni 	BB2_8;

That section I believe maps to some 32-bit multiplications of local registers with shared memory values within a for loop.

Not asking for an exact answer, just would like to get an idea of what to look for in ptx.

Note: this is for Kepler, otherwise I would use maxas (thanks Scott G) to delve into the low-level details.


As someone who has looked at more generated code than most CUDA users, my standing recommendation is to always look at the machine code (SASS) in conjunction with the source code. Looking at PTX code is only worthwhile when tracking compiler code generation issues, and I did that maybe twice a year on average, while I looked at machine code more than once a week.

That said, some observations:

(1) You could potentially share the divisions by computing the reciprocal of the divisor and multiplying by the two different dividends. That is not an identity transformation of course, so you will have to think about it or experiment with it.

(2) On some GPUs you may be able to go slightly faster by parallelizing the float->int conversions. Right now you have something like f = rintf (quotient); i = (int)f; You can increase ILP by using this instead: f = rintf (quotient); i = __float_to_int_rn (quotient); This transformation assumes no integer overflow occurs in the process.

(3) Can the floating-point adds and multiplies somehow be combined into FMA to maximize throughput? This will be much easier to determine at the source level.

(4) It looks like the code is compiled with -use_fast_math?

Yes, I did compile with the -use_fast_math flag.

By using that specific device casting instrict it changed this type of ptx line:

cvt.rzi.ftz.s32.f32	%r4, %f33;

to this:

cvt.rni.ftz.s32.f32	%r4, %f33;

Which is probably what you would expect.

In general I need to experiment, as you suggested. As far as SASS, I would like to dig in there, but unfortunately I do not have a great deal of experience in dealing with generated machine code. Are there any educational resources which would apply to SASS code generation?

Thanks again!

As far as I am aware NVIDIA only provides a one line description of most SASS instructions. This makes it hard for someone without prior exposure to assembly language programming to follow what is going on. On the other hand, your kernel is quite simple, so it may be a good starting point for getting up to speed on reading SASS.

As for the convert instructions, you should see two independent instructions versus two dependent instructions in your original code, e.g.

cvt.rni.ftz.f32.f32 %f33, %f32; // note both instructions
cvt.rni.ftz.s32.f32 %r4, %f32; // have the same source operand

Yet another thought. Your kernel seems to use shared memory quite extensively. Have you checked, using the profiler, whether the use of shared memory is fully optimized?

Interestingly I have tried different memory layouts and access patterns for that shared 252 element float array. The current implementation has it in a 3D [21][3][4] form. I did try flattening it to a 1D [252] array but it did not make too much of a difference in running time. Even tried putting it in constant memory, because many thread blocks will share some constant read only input values, but somehow was actually slower than just loading into shared .

That portion of the code is not the main bottleneck, as it is memory bound. The main problem is that that kernel ends up having to read from somewhat random global memory locations (determined by the values in the input set). Fortunately the writes have a better memory access pattern.

It is a good point about using the ptx generated this simple kernel as a start to learning SASS.

If the global load access pattern is a problem, but there is still reasonable locality, I would suggest investigating the use of textures, or at least trying to maximize the use of LDG instructions.

As a first step, you would want to religiously apply the const modifier to all pointer arguments that are read-only, and the __restrict__ modifier to all pointer arguments that are known not be aliased. Note that the latter is basically a promise made by the programmer; asserting this property incorrectly could cause the code to fail. See the Progamming Guide and the Best Practices Guide. If that fails to generate the desred LDG instructions, you could use the __ldg() intrinsic to force use of of LDG instructions when reading data that is known to be read-only for the entire duration of the kernel.

In case the computation performed by the above code performs interpolation (I cannot easily tell without reverse engineering the computation from the PTX instructions), this can be computed with just two FMAs, by re-writing the computation to interpolate between data0 and data1 as follows:

result = data1 * alpha + (-data0 * alpha + data0);


result = fma (data1, alpha, fma (-data0, alpha, data0));

I was indeed doing exactly that, using const on all read only pointer values, and using __ldg() for that particular load (which does improve performance).

Also thanks to your advice I was able to use __float2int_rn() to eliminate a cvt operation. That coupled with some other adjustments has already improved the running time by about 5%.

I tired that as well, and got a further speedup.

My concern with fmaf() is the rounding aspect. The way it is explained is that it rounds to the nearest integer value?

Is that correct and what rounding method is used? Round to nearest even?

So if the floating point result is 3.001f, it then rounds to 4.0f? What if the intermediate floating point value is 0.999f ?

fmaf() rounds the result according to the IEEE-754 rounding mode nearest-or-even. In other words, it is equivalent to __fmaf_rn(). The IEEE-754 (2008) standard defines the FMA (= fused multiply-add) operation, GPU hardware implements the instruction in compliance with the standard. fmaf() is a standard C/C++ function, the CUDA intrinsics _fmaf_r{n,z,u,d} give access to variants of FMA with each of the four IEEE-754 defined rounding modes (nearest or even, towards zero, up towards +INF, down towards -INF).

Note that the rounding here has nothing to do with rounding to an integer. The rounding is to the precision of the result operand. As the precision is limited, results of floating-point operations are usually not exactly representable without error. The rounding mode prescribes how the result is mapped to representable floating-point numbers in the vicinity of the true result.

The difference between FMA and the discrete equivalent is as follows:

A sequence of a floating-point multiply followed by an addition will round the result of the multiplication to single precision, forward the rounded result to the addition operation, with the result of that rounded to single precision as well. So that is a total of two roundings. By contrast, FMA multiplies the source operands, retains the fully precise unrounded result, forwards that to the addition operation, with the final result being rounded to single precision. That is a total of one rounding. So on average, the use of FMA will make the result more accurate as the number of rounding steps is cut in half. In addition, FMA gives you protection against subtractive cancellation when in the computation ab+c the product ab is of similar magnitude but opposite sign of c.

A look at NVIDIA’s whitepaper may be helpful if you haven’t read it yet:


Oh Wow, glad I asked…Thank you very much.

So in general using the CUDA intrinsic functions such as;

int i_x=__float2int_rz(x);

will generally be more efficient than the explicit single cast such as;

int i_x=int(x)


That’s not what I meant to convey. In general I would advocate using standard C/C++ functions and mechanisms to accomplish an objective. Intrisics are useful for those cases where (a) functionality is not expressible in ordinary C/C++ (b) compiler is limited, by restrictions requiring semantic equivalency for transformations, of making a specified substitution © compiler is able to make a specific transformation but it may or may not happen.

Your two examples in #12 are actually equivalent and should cause identical SASS to be generated.

An example of (b) would be the case of a floating-point to integer conversion with rounding mode of nearest-or-even. The only C/C++ compliant way I know to express that is as: i = (int)rintf(f); Here, rintf() rounds to nearest-or-even, while the (int) type cast is defined to use the rounding mode “towards zero”, that is, truncation. By using the intrinsic __float_to_int_rn() one can accomplish this in one instruction.

An example of © would be use of the __ldg() intrinsic to force an LDG instruction, or use of fmaf() to force use of an FMA operation, rather than separate multiply and add instructions. An example of (a) could be use of the __popc() intrinsic which gives access to GPU hardware functionality not expressible in C/C++ [unless something like this was added in the 2011 revisions to those ISO standards, I have not actually checked that closely].

I did understand what you said about the rounding with fmaf(), but brought up that separate issue regarding casting.

Right, that makes sense, thanks.