fp16 vs fp32

smartvoice · November 1, 2017, 7:50pm

Hello All,

I did some micro-benchmarking on addition operation for both fp16 and fp32 on gtx 1080 ti, with -Xptxas -O0 to turn off the optimization.

kernel for half_add()

__global__ void kern_hadd (half *my_array, uint *start_t, uint *end_t)
{
    unsigned int start_time1;
    unsigned int start_time2;
    unsigned int start_time3;

    unsigned int end_time1;
    unsigned int end_time2;
    unsigned int end_time3;

    half a = my_array[0];
    half b = my_array[1];
    half c;

    __syncthreads();

    start_time1 = clock();
    end_time1 = clock();

    __syncthreads();

    start_time2 = clock();
    c= __hadd(a, b);
    end_time2 = clock();

    __syncthreads();

    start_time3 = clock();
    end_time3 = clock();

    start_t[0] = start_time1;
    start_t[1] = start_time2;
    start_t[2] = start_time3;

    end_t[0] = end_time1;
    end_t[1] = end_time2;
    end_t[2] = end_time3;

    my_array[2] = c;
}

Sass for half_add

/*0228*/                   MOV R8, R8;                       	/* 0x5c98078000870008 */
        /*0230*/                   BAR.SYNC 0x0;                     	/* 0xf0a81b8000070000 */
        /*0238*/                   CS2R R10, SR_CLOCKLO;             /* 0x50c800000507000a */
                                                                     		/* 0x00643c03fde01fef */
        /*0248*/                   MOV R10, R10;                     	/* 0x5c98078000a7000a */
        /*0250*/                   MOV R11, R10;                     	/* 0x5c98078000a7000b */
        /*0258*/                   HADD2 R4, R4.H0_H0, R9.H0_H0; 	/* 0x5d11000020970404 */
                                                                     		/* 0x007fbc03fde01fef */
        /*0268*/                   CS2R R9, SR_CLOCKLO;              /* 0x50c8000005070009 */
        /*0270*/                   MOV R9, R9;                       	/* 0x5c98078000970009 */
        /*0278*/                   MOV R10, R9;                      	/* 0x5c9807800097000a */
                                                                     		/* 0x007fbc03fde019ef */
        /*0288*/                   BAR.SYNC 0x0;                     	/* 0xf0a81b8000070000 */
        /*0290*/                   CS2R R9, SR_CLOCKLO;              /* 0x50c8000005070009 */
        /*0298*/                   MOV R9, R9;                       	/* 0x5c98078000970009 */

syn1 45 (clk/warp) :2mov
syn2 131 (clk/warp) 2mov + hadd2
syn3 45 (clk/warp) : 2mov

__hadd() consume around 131 - 45 = 86 clocks

kernel for float_add()

__global__ void kern_fadd (float *my_array, uint *start_t, uint *end_t)
{
    unsigned int start_time1;
    unsigned int start_time2;
    unsigned int start_time3;

    unsigned int end_time1;
    unsigned int end_time2;
    unsigned int end_time3;

    float a = my_array[0];
    float b = my_array[1];
    float c;

    __syncthreads();

    start_time1 = clock();
    end_time1 = clock();

    __syncthreads();

    start_time2 = clock();
    c = a + b;
    end_time2 = clock();

    __syncthreads();

    start_time3 = clock();
    end_time3 = clock();

    start_t[0] = start_time1;
    start_t[1] = start_time2;
    start_t[2] = start_time3;

    end_t[0] = end_time1;
    end_t[1] = end_time2;
    end_t[2] = end_time3;

    my_array[2] = c;
}

Sass for float_add()

/*0228*/                   MOV R8, R8;               		/* 0x5c98078000870008 */
        /*0230*/                   BAR.SYNC 0x0;             		/* 0xf0a81b8000070000 */
        /*0238*/                   CS2R R10, SR_CLOCKLO;    	 /* 0x50c800000507000a */
                                                             			/* 0x007fbc03fde01fef */
        /*0248*/                   MOV R10, R10;             		/* 0x5c98078000a7000a */
        /*0250*/                   MOV R11, R10;            		 /* 0x5c98078000a7000b */
        /*0258*/                   FADD R7, R7, R9;          		/* 0x5c58000000970707 */
                                                            			 /* 0x007fbc03fde01fef */
        /*0268*/                   MOV R7, R7;               		/* 0x5c98078000770007 */
        /*0270*/                   CS2R R9, SR_CLOCKLO;      	/* 0x50c8000005070009 */
        /*0278*/                   MOV R9, R9;               		/* 0x5c98078000970009 */
                                                             			/* 0x007fbc033de01fef */
        /*0288*/                   MOV R10, R9;              		/* 0x5c9807800097000a */
        /*0290*/                   BAR.SYNC 0x0;             		/* 0xf0a81b8000070000 */
        /*0298*/                   CS2R R9, SR_CLOCKLO;      	/* 0x50c8000005070009 */

syn1 45 (clk/warp) 2mov
syn2 75 (clk/warp) 2mov + fadd + mov
syn3 45 (clk/warp) 2mov

Float addition consume around 15 clocks

It appears that fp16 is not as fast as fp32. Is it true? Can we say that the benefit of using fp16 is majorly from reducing the memory bandwidth?

njuffa · November 1, 2017, 9:35pm

This is true for the architecture of your GPU, which is sm_61 (= compute capability 6.1). Only architectures sm_60, sm_70, and possibly sm_62 (not sure about the last one) are designed for high FP16 computational throughput.

For all other architectures, FP16 makes a lot of sense as a storage format (a lot of sensor data only requires FP16 due to the use of 10-bit ADCs, for example) while doing all computation with high-throughput FP32 computation, optimizing use of memory bandwidth in this way. By the way, reading from an FP16 texture automatically converts the data to FP32, so that may be a something to take into consideration.

smartvoice · November 13, 2017, 10:21pm

Hello njuffa,

Thanks for your reply.
I did some profiling on the P100. Below are some results.

github.com

leimingyu/cuda_mixedPrecision/blob/master/others/hitgrid/results.md

### comparison
fp 32
```c
htime = (a>=0.f);
```

sass
```bash
        /*0170*/                   BAR.SYNC 0x0;                     /* 0xf0a81b8000070000 */
        /*0178*/                   CS2R RZ, SR_CLOCKLO;              /* 0x50c80000050700ff */
                                                                     /* 0x007fbc03fde01fef */
        /*0188*/                   CS2R R7, SR_CLOCKLO;              /* 0x50c8000005070007 */
        /*0190*/                   MOV R7, R7;                       /* 0x5c98078000770007 */
        /*0198*/                   MOV R7, R7;                       /* 0x5c98078000770007 */
                                                                     /* 0x00643c03fde01fef */
        /*01a8*/                   FSETP.GE.AND P0, PT, R0, RZ, PT;  /* 0x5bb603800ff70007 */
        /*01b0*/                   SEL R0, RZ, 0x1, !P0;             /* 0x38a004000017ff00 */
        /*01b8*/                   I2F.F32.S32 R0, R0;               /* 0x5cb8000000072a00 */
                                                                     /* 0x007fbc00fe201fef */
        /*01c8*/                   MOV R0, R0;                       /* 0x5c98078000070000 */

This file has been truncated. show original

fp32 seems comparable to fp16, in terms of clock cycles.

Besides bandwidth reduction, how to use fp16 efficiently? Do I need to use half2 to vectorize the computation?

Thanks!

njuffa · November 13, 2017, 10:32pm

It’s all right there in the Pascal whitepaper:

https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

So yeah, you need to use half2 computation to get the doubled FLOPS rate compared to FP32.

Topic		Replies	Views
Bf16 slower than fp32 on A10 and A100? CUDA Programming and Performance cuda , kernel , deep-learning , a100	4	1106	July 13, 2024
TX2 with FP16 Running Slower than FP32 Jetson TX2	22	4225	October 18, 2021
converting fp32 math to fp16 fails to give speed up CUDA Programming and Performance	5	1514	November 21, 2017
FP16 vs FP32 CUDA Programming and Performance	3	2385	May 23, 2019
How FP32 and FP16 units are implemented in GP100 GPU's CUDA Programming and Performance	8	7532	March 28, 2017
Why the number of flops is different between FP32 and FP16 mode with YOLOv3 TensorRT implementation? Jetson AGX Xavier tensorrt , kernel , profiling	8	4000	October 18, 2021
Verify cuda core peak fp32 performance CUDA Programming and Performance	10	442	May 2, 2024
FP64 to FP16 Conversion: __double2half vs. __float2half(float(x)) GPU-Accelerated Libraries	1	172	July 6, 2024
Benchmark numbers wrong for half math on 1060 CUDA Programming and Performance	9	1216	January 23, 2017
Mixed-Precision Programming with CUDA 8 Technical Blog	1	391	February 23, 2017

fp16 vs fp32

Related topics