Division Slow Path

The programming guide only mentions slow path for trigonometric instructions. However it seems nvcc generates a slow path division function for a kernel of mine named __cuda_sm3x_div_rn_noftz_f32_slowpath.
When profiling the kernel with Nsight Compute, I noticed a significant amount of execution time is spent in this function. Assuming that the slow path is only taken for arguments large in magnitude, this result is unexpected, since I only divide by values x with 0.01f <= x <<< 105615.0f. Replacing all / operators with __fdividef eliminates this issue and speeds up execution time threefold.
What can be done to avoid the slow path division? Is my solution viable when I want to retain accuracy?

The symbol name suggests that this is code for compute capability 3.x. Is your target GPU actually a GPU with compute capablity 3.x? If not, you would want to specify the correct target architecture on the nvcc commandline.

I compiled a sample single-precision division for sm_30, and it seems (I have not fully annotated the disassembly, as it is quite tedious) the division slowpath is taken for overflow and underflow cases. Given your range of divisors, it seems likely that you are hitting the underflow case, i.e. many of your dividends are already very small and produce subnormal quotients.

Even without hitting the slowpath, an IEEE-754 compliant floating division is not going to be as fast as the approximate division __fdividef(). The single-precision division fastpath for sm_30 is a called subroutine of 17 instructions, while even with -ftz=false, __fdividef() results in 5 inlined instructions.

I think I have noted before that for optimal performance NVIDIA should look into inlining the single-precision division fastpath, leaving just the slowpath as a called subroutine. Ah, here:

1 Like

This section of the programming guide seems useful.

Probably the first question would be what accuracy do you need in your single-precision floating point division operations? Do you need 0 ulp error, or can you tolerate up to 2 ulp error?

If we assume your question implies that you need the 0 ulp error available from the IEEE-754 compliant / operation (with --prec-div unspecified, or true) then no, your solution is not viable to retain accuracy.

If you can tolerate up to 2 ulp error as compared to the IEEE-754 compliant / operation (and subject to additional questions/conditions), then there are subsequent questions that need to be asked/answered. I believe these can be teased out of the documentation, and the divisor range you have given is an important point.

1 Like

Visual Studio is compiling with -gencode=arch=compute_75,code=\"sm_75,compute_75\". The first lines of the PTX file are:

.version 7.5
.target sm_75, debug

I’m not sure why the generated function has sm3x in its name.

I was thinking we only use the slow path when the divisor under- or overflows. The dividend can actually go arbitrarily close to 0. Will this also result in slow path division?

Ideally I would need 0 ulp without slow path, e.g., rounding to zero like __fdividef when 2^126 < |y| < 2^128 or when x is very small (assuming my previous question is answered with YES).

One possible reason is that the library implementers have chosen to use that routine even for higher compute capabilities. It’s not unheard of. The sm3x function may have been found suitable according to some criteria for your compile settings.

Then __fdividef is not an option. I don’t think its possible to have any meaningful impact on the selection of the fast path vs. the slow path except by studying your inputs, and even then I’m not sure what purpose it would serve (based on the inputs you don’t do the division, or something? It’s unclear to me what the idea is here.)

If you can tolerate denormals being flushed to zero, you might try --ftz=true to see if it has any perf impact.

1 Like

Your previous post was in fact exceedingly clear. Since most of your divisors are much larger than unity, in most cases the magnitude of the quotient will be smaller than the magnitude of dividend. If the dividends are arbitrarily close to 0, the quotients are thus arbitrarily closer to 0, meaning there is a good chance that some (or even many) of them will be subnormals, that is, |quotient| < 2-126. When that happens, the slowpath is invoked, at least for compute capability 3.x.

You seem be to under the mistaken impression that the magnitude of the divisor controls whether the slowpath is invoked. But looking at the SASS generated by the compiler for compute capability 3.x it seems quite clear that the slowpath is triggered by an overflow / underflow condition in the quotient.

As for the symbol name: The slowpath code may have been originally written for compute capability 3.x and may have been retained for later architectures, preserving the name. With newer architectures, the special case handling in single-precision division is controlled by the FCHK machine instruction, which takes both dividend and divisor as source operands and produces a predicate. It is neither publicly documented nor accessible from PTX, so I cannot tell (without much reverse engineering work) when it triggers the slowpath. The above analysis therefore may or may not be applicable to compute capabilities > 3.x.

You could try compiling with -ftz=true to flush subnormals to zero, to see whether that provides any speedup. Obviously that flag will affect all computation in the compilation unit, which may or may not result in functionally correct behavior of the code. It depends on what that computation is doing.

1 Like

Thank you for your replies, you have helped my understanding tremendously.
I will come back with results once I have tried flushing the subnormals.

Unfortunately, ftz=true does not result in the performance improvements I hoped for.
I have tried @njuffa’s code in his post he linked earlier.
However, I have replaced the slow path invocation with return 0.f. The kernel performs almost as fast as with __fdividef.
Would fp32_div(x,y) now result in what I’m expecting, i.e., compute the quotient with ulp 0 and flush subnormal quotients to zero?

Complete code for reference:

/* Subnormal arguments and results are treated as zero */
__forceinline__ __device__ float rcp_approx_gpu(float divisor)
	float r;
	asm("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
	return r;

__forceinline__ __device__ float fp32_div(float dividend, float divisor)
	const unsigned int FP32_MANT_MASK = 0x007fffffu;
	const unsigned int FP32_ONE = 0x3f800000u;
	const unsigned int FP32_SIGN_MASK = 0x80000000u;
	const unsigned int FP32_SIGN_EXPO_MASK = 0xff800000u;
	const unsigned int FP32_QUOTIENT_HI_LIMIT = 0x7f7fffffu; // 0x1.0p+128 - ulp
	const unsigned int FP32_QUOTIENT_LO_LIMIT = 0x00800001u; // 0x1.0p-126 + ulp

	// fast path
	float recip_approx = rcp_approx_gpu(divisor);
	float recip_err = fmaf(recip_approx, -divisor, 1.0f);
	float refined_recip = fmaf(recip_approx, recip_err, recip_approx);
	float dividend_mant = __int_as_float((__float_as_int(dividend) & FP32_MANT_MASK) | FP32_ONE);
	float dividend_scale = __int_as_float(__float_as_int(dividend) & FP32_SIGN_EXPO_MASK);
	float refined_quotient_mant = refined_recip * dividend_mant;
	float refined_quotient_residual = fmaf(refined_quotient_mant, -divisor, dividend_mant);
	float final_quotient_mant = fmaf(refined_recip, refined_quotient_residual, refined_quotient_mant);
	float final_quotient = final_quotient_mant * dividend_scale;

	// check if we need to apply the slow path and invoke it if that is the case
	unsigned int iq = __float_as_int(final_quotient) & ~FP32_SIGN_MASK;
		return 0.f;
	return final_quotient;

Two caveats:

(1) Generally speaking, underflow is not the only condition that triggers invocation of the slowpath code. With your modification, non-underflow special cases would now also return zero.

(2) As I stated in my other post, I have not proven my fastpath code correct (= returns correct round-to-nearest quotient). I just did enough testing to demonstrate to my satisfaction that it is not broken in major ways. Comparing performance between two pieces of code does not make sense if they do not work in a functionally equivalent way:“If results are allowed to be wrong they can be delivered arbitrarily fast”.

It is up to you to decide whether the risk due to these caveats is tolerable.

Thank you.
I have determined that the slow path is taken when the dividend is zero. There are no subnormals in my input data. The majority of my input data is zeros, thus division is taking up most of the execution time.
Is this a bug or just is the div operation not optimized for this special case?

I refer back to my statement that my earlier analysis pertains to sm_3x code. Looking at the disassembly of the relevant SASS code, there is no provision for special case handling a dividend of zero that I can see. But I have not spent the time to work out every last detail. Generally, when one creates a fastpath for the most common case(s), one cannot spend much time on sorting out special cases up front, otherwise the fastpath will no longer be fast. The sm_3x code detects all special cases with a single comparison/branch, and the slowpath code then has to sort out the details. From a quick glance, a zero dividend seems to be handled fairly early in the slowpath (but incurs non-trivial function call overhead to that point).

As I also pointed out previously, the slowpath code for later GPU architectures including the one of your GPU is invoked based on a predicate computed by the FCHK instruction, which is not publicly documented. One school of thought about checking instructions (I have designed a few myself in the past, in the context of an x87 FPU) is that they do two things: (1) determine whether a special case is present, indicating this by a flag or predicate (2) supply a result for special cases that do not involve actual computation. Whether FCHK does both of these, I do not know, and have no intentions of reverse engineering. If it does, the case of a dividend of zero should be handled by FCHK itself and thus be fast.

If most of your dividends are zero, you could try using a wrapper function:

float fp32div_wrapper (float dividend, float divisor)
    return ((dividend == 0.0f) && (divisor != 0.0f)) ? 0.0f : (dividend / divisor);

Please note that IEEE-754 specifies signed zeros (essential in some contexts), and that the above wrapper replaces that with a canonical zero result, making the resulting division operation non-compliant with IEEE-754. It also returns zero for a division of zero by a NaN divisor, which is likewise non-compliant (NaNs are supposed to propagate through).

As with all performance effects exceeding the level of measurement noise (2%) at application level, you might want to consider filing an enhancement request with NVIDIA. Maybe it is time for a general overhaul of floating-point division to fully optimize it for the latest architectures.