How is FP64 division implemented

lxzhang · January 14, 2020, 4:40am

I am trying to understand how division is implemented by reading the sass code for the following simple kernel

__global__ void SI(double* arrIn,
                   double* arrOut,
                   int N)
{
    unsigned int pid = threadIdx.x + blockIdx.x * blockDim.x;
    double x0 = arrIn[pid];
    double y0 = arrIn[pid + N];
    double P = (21.18 + 4.37e-4 * y0) * y0;
    double result = 1.8847719e-3  * P / x0;
    arrOut[pid] = result;
}

I compiled this kernel using nvcc 10.2 on sm_35 and its sass code is as follows

label0:
	     MOV R1, c[0x0][0x44];
	     S2R R0, SR_TID.X;
	     MOV32I R4, 0x7ae147ae;
	     S2R R3, SR_CTAID.X;
	     MOV32I R5, 0x40352e14;
	     IMAD R0, R3, c[0x0][0x28], R0;
	     MOV32I R3, 0x8;
	     ISCADD R6.CC, R0, c[0x0][0x140], 0x3;
	     MOV32I R8, 0x1;
	     IADD R2, R0, c[0x0][0x150];
	     SSY label3;
	     IMAD.U32.U32.HI.X R7, R0, R3, c[0x0][0x144];
	     ISCADD R12.CC, R2, c[0x0][0x140], 0x3;
	     LD.E.64 R6, [R6];
	     IMAD.U32.U32.HI.X R13, R2, R3, c[0x0][0x144];
	     LD.E.64 R2, [R12];
	     MUFU.RCP64H R9, R7;
	     DFMA R10, -R6, R8, c[0x2][0x10];
	     DFMA R4, R2, c[0x2][0x0], R4;
	     DFMA R10, R10, R10, R10;
	     DMUL R2, R2, R4;
	     DFMA R8, R8, R10, R8;
	     DMUL R2, R2, c[0x2][0x8];
	     DMUL R4, R2, R8;
	     DFMA R10, -R6, R4, R2;
	     DFMA R8, R8, R10, R4;
	     FSET.GTU.AND RZ.CC, |R3|, c[0x2][0x18], PT; 
	     FFMA R4, RZ, R7, R9;
	     FSETP.GT.AND P0, PT, |R4|, c[0x2][0x1c], PT;
	@P0  BRA CC.NEU, label2;

label1:
	     CAL label4;

label2:
	     ISCADD.S R2.CC, R0, c[0x0][0x148], 0x3;

label3:
	     MOV32I R3, 0x8;
	     IMAD.U32.U32.HI.X R3, R0, R3, c[0x0][0x14c];
	     ST.E.64 [R2], R8;
	     EXIT;

label4:
	     LOP32I.AND R4, R7, 0x40000000;
	     LOP32I.AND R9, R3, 0x7f800000;
	     MOV32I R8, 0x5ff00000;
	     MOV32I R14, 0x1;
	     MOV R12, RZ;
	     PBK label10;
	     ISETP.GE.U32.AND P0, PT, R4, c[0x2][0x20], PT;
	     MOV R4, RZ;
	     SEL R5, R8, c[0x2][0x24], !P0;
	     ISETP.GE.U32.AND P0, PT, R9, c[0x2][0x28], PT;
	     DMUL R10, R6, R4;
	     MUFU.RCP64H R15, R11;
	     SEL R13, R8, c[0x2][0x24], !P0;
	     DFMA R16, -R10, R14, c[0x2][0x10];
	     DMUL R12, R2, R12;
	     DFMA R16, R16, R16, R16;
	     DFMA R16, R14, R16, R14;
	     DMUL R14, R12, R16;
	     DFMA R10, -R10, R14, R12;
	     DFMA R10, R16, R10, R14;
	     DSETP.GT.AND P0, PT, |R10|, RZ, PT;
	@!P0 BRA label7;

label5:
	     ISETP.GT.U32.AND P0, PT, R9, c[0x2][0x2c], PT;
	     DMUL R12, R4, R10;
	     SEL R9, R8, c[0x2][0x24], P0;
	     MOV R8, RZ;
	     DMUL R10, R10, R8;
	     DMUL R8, R8, R12;
	     DMUL R4, R4, R10;
	     DFMA R10, R6, R8, -R2;
	     DFMA R12, R6, R4, -R2;
	     DSETP.GT.AND P0, PT, |R10|, |R12|, PT;
	     SEL R5, R5, R9, P0;
	     SEL R8, R4, R8, P0;
	     FSETP.GTU.AND P0, PT, |R5|, 1.469367938527859385e-39, PT;
	     MOV R9, R5;
	@P0  BRK;

label6:
	     FSETP.GEU.AND P0, PT, |R3|, 1.5046327690525280102e-36, PT;
	     LOP32I.AND R4, R8, 0xfffffffe;
	     MOV32I R12, 0x58500000;
	     MOV R9, R5;
	     LOP32I.OR R8, R8, 0x1;
	     SEL R13, R12, c[0x2][0x14], !P0;
	     MOV R12, RZ;
	     DMUL R6, R6, R12;
	     DMUL R2, R2, R12;
	     DFMA R10, R4, R6, -R2;
	     DFMA R12, R8, R6, -R2;
	     DSETP.GT.AND P0, PT, |R10|, |R12|, PT;
	     SEL R13, R8, R4, P0;
	     IADD32I R8.CC, R13, 0x1;
	     LOP32I.AND R4, R13, 0x1;
	     ISETP.NE.U32.AND P0, PT, R4, 0x1, PT;
	     IADD.X R9, RZ, R5;
	     IADD32I R12.CC, R13, -0x1;
	     SEL R8, R13, R8, P0;
	     IADD32I.X R4, R5, -0x1;
	     SEL R9, R5, R9, P0;
	     SEL R12, R12, R13, P0;
	     SEL R13, R4, R5, P0;
	     DFMA R4, R6, R8, -R2;
	     DFMA R2, R6, R12, -R2;
	     DSETP.GT.AND P0, PT, |R4|, |R2|, PT;
	     SEL R8, R12, R8, P0;
	     SEL R9, R13, R9, P0;
	     BRK;

label7:
	     DSETP.NEU.AND P0, PT, R10, RZ, PT;
	@!P0 BRA label9;

label8:
	     MUFU.RCP64H R9, R7;
	     MOV R8, RZ;
	     DSETP.GT.AND P1, PT, |R8|, RZ, PT;
	@!P1 DSETP.NEU.AND P0, PT, |R6|, +INF , PT;
	@!P1 SEL R4, R6, R8, P0;
	@!P1 SEL R7, R7, R9, P0;
	@!P1 MOV R8, R4;
	@!P1 MOV R9, R7;
	     DMUL R8, R2, R8;
	     BRK;

label9:
	     DMUL R8, R2, R6;
	     BRK;

label10:
	     RET;

label11:
	     BRA label11;

label12:
	     NOP;

I can understand everything in label0 except the last four lines

FSET.GTU.AND RZ.CC, |R3|, c[0x2][0x18], PT; 
FFMA R4, RZ, R7, R9;
FSETP.GT.AND P0, PT, |R4|, c[0x2][0x1c], PT;
@P0  BRA CC.NEU, label2;

If this predicate is taken, the control flow falls from label0 to label2 and then to label3, which goes to EXIT.
However, when this predicate is not taken, the control flow will go to label4. Does anyone know what the code following label4 is doing?

Another question is about MUFU.RCP64H R9, R7. It seems that this is the starting point of approximating a division using Newton-Raphson’s method and this instruction is supposed to compute an approximation of 1/x, where x is saved in R7:R6. However, I do not understand how this is done by performing reciprocal on the upper half of the 64-bit operand. It is highly appreciated if anyone can shed some light on this.

tera · January 14, 2020, 12:22pm

Since MUFU.RCP64H is a low-precision approximation, it does not need the least significant 32 bits of the 64 bit double argument('s mantissa).

njuffa · January 14, 2020, 3:59pm

Keep in mind that there are other iterations besides NR that provide faster convergence.

lxzhang · January 14, 2020, 6:28pm

@njuffa
Do you have any reference about other methods that NVCC uses to approximate division?

Does it mean that NVCC would take different methods to compute the division according to some condition?

njuffa · January 14, 2020, 7:03pm

I provided the original implementation of double-precision division for CUDA, but not the version that is there now. Given that I am (1) not intimately familiar with the code that is in place now, and (2) NVIDIA has chosen not to document the details of their implementation, I do not feel it is appropriate for me to comment further.

There is plenty of literature available on how to implement floating-point division with iterative methods (Google Scholar is your friend) and since the disassembled code is available to you, you should be able to successfully reverse engineer it.

What is your motivation for trying to figure out how it works?

lxzhang · January 14, 2020, 7:45pm

I am trying to develop a performance model for CUDA kernels that implement the particle filter algorithms. When it comes to FP64 division, the two branches (one corresponds to the coda in label0 and the other corresponds to the code following label4) are so different that it might be beneficial to understand when one path is taken. I will continue to reverse engineer the code in the other path to see if I can get something.

njuffa · January 14, 2020, 8:58pm

There are different kind of performance models. Depending on what kind of model it is, you could simply measure the throughput of the DP division operation for range[s] of operands you consider relevant to your modelling.

For processor architectures like x86 where the division algorithm is hidden (microcode and / or state machine) that is what you would have to do anyhow for a high-level simulation, so I am not sure why a GPU performance model would handle this differently. On the other hand, for a detailed execution-driven approach (e.g. Simics) you would simply let the model execute the instructions inside the division subroutine.

Generally speaking, emulated floating-point operations use a fast path and a slow path, where the slow path handles all exceptional cases (e.g. zero, infinity, NaN) plus cases that could trigger overflow / underflow. As a corollary, the slow path is typically longer in terms of static and dynamic instruction count.

lxzhang · January 14, 2020, 9:43pm

I am trying to compute a tight upper bound of the execution time of the kernel, which is to be used in a real-time system.

Do you means that if the operands (or the range of the operands) are available, we can know which path it takes to perform the division? Is it possible to check for the exceptional cases at software and make sure that the hardware will always take the fast path?

njuffa · January 14, 2020, 10:11pm

There is no hardware for division on the GPU. All divisions are canned instruction sequences, inlined and / or called. If you need to find the worst-case execution time, treat the operation as a black box, just like you would when performing the same exercise on an x86 CPU.

Likely slow-path candidates triggering worst case execution times across all processor architectures (CPUs and GPUs) would be divisions that involve subnormal (denormal) inputs and / or outputs.

I assume that you are aware that GPU-accelerated software is not suitable for hard real-time systems, but may be suitable for some soft real-time use cases.

lxzhang · January 15, 2020, 2:18am

Thank you @njuffa, I really learned a lot from your comment.

lxzhang · January 15, 2020, 4:08am

@tera
Just out of curiosity. When the upper 32-bit of the DP operand is used for low-precision reciprocal computation, does it mean that the computation is performed on the following form

bit 31: sign
bit 30-20: exponent
bit 19-0: mantissa

tera · January 15, 2020, 12:15pm

That is how I understand the linked documentation of the rcp.approx.ftz.f64 PTX instruction, yes.

As a side note, and if memory serves me correctly, this instruction shows up as a single precision instruction in the profiler, which makes some sense.

lxzhang · January 15, 2020, 4:19pm

Thank you @tera. This instruction skipped my eyes when I was reading the PTX doc.

njuffa · January 15, 2020, 4:58pm

It shows up as a single-precision instruction because it is handled by the MUFU (multi-function unit) as MUFU.RCP64H. This is using the exact same internal approximation as MUFU.RCP, with some adjustments for the different layout of single-precision vs double-precision formats. Because fewer mantissa bits remain in the output of MUFU.RCP64H, the accuracy is slightly lower than for MUFU.RCP.

Topic		Replies	Views
High Compute in Flight, low DRAM Bandwidth usage CUDA Programming and Performance	35	90	January 19, 2025
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	2794	October 5, 2022
Weekend project: FP64 emulation on consumer-grade GPUs CUDA Programming and Performance	5	1593	April 29, 2024
Speed comparison of division compared to other arithmetic operations, perhaps something like clock cycles CUDA Programming and Performance	9	5493	November 19, 2024
Division Slow Path CUDA Programming and Performance	11	1555	January 25, 2022
How to use the `ex2.approx.f16x2` instruction? CUDA Programming and Performance	2	198	August 28, 2024
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1554	September 27, 2024
performance difference for cuda between experiments and the documentation for float/double data type... CUDA Programming and Performance	8	1897	October 28, 2016
Mixed-Precision Programming with CUDA 8 Technical Blog	1	387	February 23, 2017
Bitslice-DES optimization CUDA Programming and Performance	55	12593	January 29, 2022

How is FP64 division implemented

Related topics