How is FP64 division implemented

I am trying to understand how division is implemented by reading the sass code for the following simple kernel

__global__ void SI(double* arrIn,
                   double* arrOut,
                   int N)
    unsigned int pid = threadIdx.x + blockIdx.x * blockDim.x;
    double x0 = arrIn[pid];
    double y0 = arrIn[pid + N];
    double P = (21.18 + 4.37e-4 * y0) * y0;
    double result = 1.8847719e-3  * P / x0;
    arrOut[pid] = result;

I compiled this kernel using nvcc 10.2 on sm_35 and its sass code is as follows

	     MOV R1, c[0x0][0x44];
	     S2R R0, SR_TID.X;
	     MOV32I R4, 0x7ae147ae;
	     S2R R3, SR_CTAID.X;
	     MOV32I R5, 0x40352e14;
	     IMAD R0, R3, c[0x0][0x28], R0;
	     MOV32I R3, 0x8;
	     ISCADD R6.CC, R0, c[0x0][0x140], 0x3;
	     MOV32I R8, 0x1;
	     IADD R2, R0, c[0x0][0x150];
	     SSY label3;
	     IMAD.U32.U32.HI.X R7, R0, R3, c[0x0][0x144];
	     ISCADD R12.CC, R2, c[0x0][0x140], 0x3;
	     LD.E.64 R6, [R6];
	     IMAD.U32.U32.HI.X R13, R2, R3, c[0x0][0x144];
	     LD.E.64 R2, [R12];
	     MUFU.RCP64H R9, R7;
	     DFMA R10, -R6, R8, c[0x2][0x10];
	     DFMA R4, R2, c[0x2][0x0], R4;
	     DFMA R10, R10, R10, R10;
	     DMUL R2, R2, R4;
	     DFMA R8, R8, R10, R8;
	     DMUL R2, R2, c[0x2][0x8];
	     DMUL R4, R2, R8;
	     DFMA R10, -R6, R4, R2;
	     DFMA R8, R8, R10, R4;
	     FSET.GTU.AND RZ.CC, |R3|, c[0x2][0x18], PT; 
	     FFMA R4, RZ, R7, R9;
	     FSETP.GT.AND P0, PT, |R4|, c[0x2][0x1c], PT;
	@P0  BRA CC.NEU, label2;

	     CAL label4;

	     ISCADD.S R2.CC, R0, c[0x0][0x148], 0x3;

	     MOV32I R3, 0x8;
	     IMAD.U32.U32.HI.X R3, R0, R3, c[0x0][0x14c];
	     ST.E.64 [R2], R8;

	     LOP32I.AND R4, R7, 0x40000000;
	     LOP32I.AND R9, R3, 0x7f800000;
	     MOV32I R8, 0x5ff00000;
	     MOV32I R14, 0x1;
	     MOV R12, RZ;
	     PBK label10;
	     ISETP.GE.U32.AND P0, PT, R4, c[0x2][0x20], PT;
	     MOV R4, RZ;
	     SEL R5, R8, c[0x2][0x24], !P0;
	     ISETP.GE.U32.AND P0, PT, R9, c[0x2][0x28], PT;
	     DMUL R10, R6, R4;
	     MUFU.RCP64H R15, R11;
	     SEL R13, R8, c[0x2][0x24], !P0;
	     DFMA R16, -R10, R14, c[0x2][0x10];
	     DMUL R12, R2, R12;
	     DFMA R16, R16, R16, R16;
	     DFMA R16, R14, R16, R14;
	     DMUL R14, R12, R16;
	     DFMA R10, -R10, R14, R12;
	     DFMA R10, R16, R10, R14;
	     DSETP.GT.AND P0, PT, |R10|, RZ, PT;
	@!P0 BRA label7;

	     ISETP.GT.U32.AND P0, PT, R9, c[0x2][0x2c], PT;
	     DMUL R12, R4, R10;
	     SEL R9, R8, c[0x2][0x24], P0;
	     MOV R8, RZ;
	     DMUL R10, R10, R8;
	     DMUL R8, R8, R12;
	     DMUL R4, R4, R10;
	     DFMA R10, R6, R8, -R2;
	     DFMA R12, R6, R4, -R2;
	     DSETP.GT.AND P0, PT, |R10|, |R12|, PT;
	     SEL R5, R5, R9, P0;
	     SEL R8, R4, R8, P0;
	     FSETP.GTU.AND P0, PT, |R5|, 1.469367938527859385e-39, PT;
	     MOV R9, R5;
	@P0  BRK;

	     FSETP.GEU.AND P0, PT, |R3|, 1.5046327690525280102e-36, PT;
	     LOP32I.AND R4, R8, 0xfffffffe;
	     MOV32I R12, 0x58500000;
	     MOV R9, R5;
	     LOP32I.OR R8, R8, 0x1;
	     SEL R13, R12, c[0x2][0x14], !P0;
	     MOV R12, RZ;
	     DMUL R6, R6, R12;
	     DMUL R2, R2, R12;
	     DFMA R10, R4, R6, -R2;
	     DFMA R12, R8, R6, -R2;
	     DSETP.GT.AND P0, PT, |R10|, |R12|, PT;
	     SEL R13, R8, R4, P0;
	     IADD32I R8.CC, R13, 0x1;
	     LOP32I.AND R4, R13, 0x1;
	     ISETP.NE.U32.AND P0, PT, R4, 0x1, PT;
	     IADD.X R9, RZ, R5;
	     IADD32I R12.CC, R13, -0x1;
	     SEL R8, R13, R8, P0;
	     IADD32I.X R4, R5, -0x1;
	     SEL R9, R5, R9, P0;
	     SEL R12, R12, R13, P0;
	     SEL R13, R4, R5, P0;
	     DFMA R4, R6, R8, -R2;
	     DFMA R2, R6, R12, -R2;
	     DSETP.GT.AND P0, PT, |R4|, |R2|, PT;
	     SEL R8, R12, R8, P0;
	     SEL R9, R13, R9, P0;

	     DSETP.NEU.AND P0, PT, R10, RZ, PT;
	@!P0 BRA label9;

	     MUFU.RCP64H R9, R7;
	     MOV R8, RZ;
	     DSETP.GT.AND P1, PT, |R8|, RZ, PT;
	@!P1 DSETP.NEU.AND P0, PT, |R6|, +INF , PT;
	@!P1 SEL R4, R6, R8, P0;
	@!P1 SEL R7, R7, R9, P0;
	@!P1 MOV R8, R4;
	@!P1 MOV R9, R7;
	     DMUL R8, R2, R8;

	     DMUL R8, R2, R6;


	     BRA label11;


I can understand everything in label0 except the last four lines

FSET.GTU.AND RZ.CC, |R3|, c[0x2][0x18], PT; 
FFMA R4, RZ, R7, R9;
FSETP.GT.AND P0, PT, |R4|, c[0x2][0x1c], PT;
@P0  BRA CC.NEU, label2;

If this predicate is taken, the control flow falls from label0 to label2 and then to label3, which goes to EXIT.
However, when this predicate is not taken, the control flow will go to label4. Does anyone know what the code following label4 is doing?

Another question is about MUFU.RCP64H R9, R7. It seems that this is the starting point of approximating a division using Newton-Raphson’s method and this instruction is supposed to compute an approximation of 1/x, where x is saved in R7:R6. However, I do not understand how this is done by performing reciprocal on the upper half of the 64-bit operand. It is highly appreciated if anyone can shed some light on this.

Since MUFU.RCP64H is a low-precision approximation, it does not need the least significant 32 bits of the 64 bit double argument('s mantissa).

Keep in mind that there are other iterations besides NR that provide faster convergence.

Do you have any reference about other methods that NVCC uses to approximate division?

Does it mean that NVCC would take different methods to compute the division according to some condition?

I provided the original implementation of double-precision division for CUDA, but not the version that is there now. Given that I am (1) not intimately familiar with the code that is in place now, and (2) NVIDIA has chosen not to document the details of their implementation, I do not feel it is appropriate for me to comment further.

There is plenty of literature available on how to implement floating-point division with iterative methods (Google Scholar is your friend) and since the disassembled code is available to you, you should be able to successfully reverse engineer it.

What is your motivation for trying to figure out how it works?

I am trying to develop a performance model for CUDA kernels that implement the particle filter algorithms. When it comes to FP64 division, the two branches (one corresponds to the coda in label0 and the other corresponds to the code following label4) are so different that it might be beneficial to understand when one path is taken. I will continue to reverse engineer the code in the other path to see if I can get something.

There are different kind of performance models. Depending on what kind of model it is, you could simply measure the throughput of the DP division operation for range[s] of operands you consider relevant to your modelling.

For processor architectures like x86 where the division algorithm is hidden (microcode and / or state machine) that is what you would have to do anyhow for a high-level simulation, so I am not sure why a GPU performance model would handle this differently. On the other hand, for a detailed execution-driven approach (e.g. Simics) you would simply let the model execute the instructions inside the division subroutine.

Generally speaking, emulated floating-point operations use a fast path and a slow path, where the slow path handles all exceptional cases (e.g. zero, infinity, NaN) plus cases that could trigger overflow / underflow. As a corollary, the slow path is typically longer in terms of static and dynamic instruction count.

I am trying to compute a tight upper bound of the execution time of the kernel, which is to be used in a real-time system.

Do you means that if the operands (or the range of the operands) are available, we can know which path it takes to perform the division? Is it possible to check for the exceptional cases at software and make sure that the hardware will always take the fast path?

There is no hardware for division on the GPU. All divisions are canned instruction sequences, inlined and / or called. If you need to find the worst-case execution time, treat the operation as a black box, just like you would when performing the same exercise on an x86 CPU.

Likely slow-path candidates triggering worst case execution times across all processor architectures (CPUs and GPUs) would be divisions that involve subnormal (denormal) inputs and / or outputs.

I assume that you are aware that GPU-accelerated software is not suitable for hard real-time systems, but may be suitable for some soft real-time use cases.

Thank you @njuffa, I really learned a lot from your comment.

Just out of curiosity. When the upper 32-bit of the DP operand is used for low-precision reciprocal computation, does it mean that the computation is performed on the following form

bit 31: sign
bit 30-20: exponent
bit 19-0: mantissa

That is how I understand the linked documentation of the rcp.approx.ftz.f64 PTX instruction, yes.

As a side note, and if memory serves me correctly, this instruction shows up as a single precision instruction in the profiler, which makes some sense.

Thank you @tera. This instruction skipped my eyes when I was reading the PTX doc.

It shows up as a single-precision instruction because it is handled by the MUFU (multi-function unit) as MUFU.RCP64H. This is using the exact same internal approximation as MUFU.RCP, with some adjustments for the different layout of single-precision vs double-precision formats. Because fewer mantissa bits remain in the output of MUFU.RCP64H, the accuracy is slightly lower than for MUFU.RCP.