Why a division by constant is not replaced by a multiplication by constant

Consider the following kernel:

const float DIVISOR = 0.0472440943;
const float VALIDATION_TOLERANCE = 1e-5;

__global__ void kernel(float *input, size_t n) {
  int k = threadIdx.x + blockIdx.x * blockDim.x;
  if (k < n)
    input[k] /= DIVISOR;

I compiled using the following command:
nvcc main.cu --ptx --gpu-architecture=sm_80

PTX code looks like:

// Generated by NVIDIA NVVM Compiler
// Compiler Build ID: CL-32688072
// Cuda compilation tools, release 12.1, V12.1.105
// Based on NVVM 7.0.1

.version 8.1
.target sm_80
.address_size 64

	// .globl	_Z6kernelPfm

.visible .entry _Z6kernelPfm(
	.param .u64 _Z6kernelPfm_param_0,
	.param .u64 _Z6kernelPfm_param_1
	.reg .pred 	%p<2>;
	.reg .f32 	%f<3>;
	.reg .b32 	%r<5>;
	.reg .b64 	%rd<7>;

	ld.param.u64 	%rd2, [_Z6kernelPfm_param_0];
	ld.param.u64 	%rd3, [_Z6kernelPfm_param_1];
	mov.u32 	%r1, %tid.x;
	mov.u32 	%r2, %ntid.x;
	mov.u32 	%r3, %ctaid.x;
	mad.lo.s32 	%r4, %r3, %r2, %r1;
	cvt.s64.s32 	%rd1, %r4;
	setp.ge.u64 	%p1, %rd1, %rd3;
	@%p1 bra 	$L__BB0_2;

	cvta.to.global.u64 	%rd4, %rd2;
	shl.b64 	%rd5, %rd1, 2;
	add.s64 	%rd6, %rd4, %rd5;
	ld.global.f32 	%f1, [%rd6];
	div.rn.f32 	%f2, %f1, 0f3D418306;
	st.global.f32 	[%rd6], %f2;



I wonder why the compiler did not optimize the division by multiplication with a reciprocal of the divisor.


1 Like

To start, please read the article I linked above. In short, the compiler does not do this in the general case because the optimization does not produce a bitwise identical result, and for basic arithmetic operations +, -, x, /, CUDA claims bitwise compliance to a proper IEEE-754 result for floating point arithmetic: 1 2 3

For the particular example you have shown, you can “enable” the compiler to make such an optimization, one possible method is use of --use_fast_math compiler switch to nvcc.

In general, its unwise in my opinion to attempt to obtain the best understanding of what is going on by using PTX. Instead, studying the SASS gives a better view. The process of converting PTX->SASS goes thru an optimizing compiler stage.

Therefore the effect of the above switch will be “less” evident at the PTX level, and “more” evident at the SASS level.

The switch results in producing PTX for your example that includes this alternate instruction:

div.approx.ftz.f32 ...

Studying the SASS, however, we see that the division routine has been replaced with a single multiply instruction:

    /*00c0*/                   FMUL.FTZ R5, R0, 21.166666030883789062 ;                 /* 0x41a9555500057820 */

where 1/0.0472440943 ~= 21.166666…

1 Like

Thanks for the quick and detailed answers.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.