Why are the calculations different between CPU and GPU?

jonghwan.chae · February 7, 2020, 6:48am

Why are the calculations different between CPU and GPU?
Are there anyone who know the cause and how to make the results same?

Test Device : Titan-X , Quadro RTX 6000

DIFF(11) / 5.7718768120 -5.4894809723 -0.4568047523 : CPU(63.6576385498) GPU(63.6576347351) 
DIFF(17) / 7.0205116272 -3.9807739258 0.9889754057 : CPU(66.1122131348) GPU(66.1122207642) 
DIFF(21) / 6.2994036674 -4.5103206635 1.7197880745 : CPU(62.9831542969) GPU(62.9831504822) 
DIFF(25) / 5.3536248207 -5.5441226959 0.7952759862 : CPU(60.0310630798) GPU(60.0310592651) 
DIFF(26) / 5.4337601662 -5.4639873505 2.3032853603 : CPU(64.6860275269) GPU(64.6860351562) 
DIFF(28) / 7.2088255882 -3.6889219284 2.8629994392 : CPU(73.7720794678) GPU(73.7720718384) 
DIFF(31) / 7.6020116806 -2.9594936371 3.0124745369 : CPU(75.6241836548) GPU(75.6241912842) 
DIFF(35) / 9.5662813187 -0.9373407364 2.2401032448 : CPU(97.4104156494) GPU(97.4104080200) 
DIFF(37) / 10.5662345886 0.3695564270 2.8683333397 : CPU(120.0092239380) GPU(120.0092163086)

#include <stdio.h>

float data[][5] = {
    {4.6963300705, -6.4615783691, 0.0183835626, 63.8078498840}, 
    {4.7764654160, -6.3814430237, 1.5263929367, 65.8673095703}, 
    {5.3628854752, -5.7950229645, 2.2014012337, 67.1889953613}, 
    {6.5515308380, -4.6063776016, 2.0861070156, 68.4931182861}, 
    {4.6488780975, -6.5090303421, 2.8611006737, 72.1654434204}, 
    {4.8525705338, -6.4884567261, -1.3599994183, 67.4971160889}, 
    {4.9327058792, -6.4083213806, 0.1480098963, 65.4200744629}, 
    {5.5191259384, -5.8219013214, 0.8230181932, 65.0326461792}, 
    {6.7077713013, -4.6332559586, 0.7077239752, 66.9621276855}, 
    {4.8051185608, -6.5359086990, 1.4827176332, 68.0057144165}, 
    {5.6917414665, -5.5696163177, -1.9648141861, 67.2770385742},
    {5.7718768120, -5.4894809723, -0.4568047523, 63.6576347351},
    {6.3582968712, -4.9030609131, 0.2182035446, 64.5155639648}, 
    {7.5469422340, -3.7144155502, 0.1029093266, 70.7638092041}, 
    {5.6442894936, -5.6170682907, 0.8779029846, 64.1801757812}, 
    {6.3539562225, -4.6473293304, -1.1940422058, 63.3961639404}, 
    {6.4340915680, -4.5671939850, 0.3139671087, 62.3553695679}, 
    {7.0205116272, -3.9807739258, 0.9889754057, 66.1122207642}, 
    {8.2091569901, -2.7921285629, 0.8736811876, 75.9495620728}, 
    {6.3065042496, -4.6947813034, 1.6486748457, 64.5310974121}, 
    {6.2192683220, -4.5904560089, 0.2117787600, 59.7964363098}, 
    {6.2994036674, -4.5103206635, 1.7197880745, 62.9831504822}, 
    {6.8858237267, -3.9239006042, 2.3947963715, 68.5466156006}, 
    {8.0744686127, -2.7352552414, 2.2795021534, 77.8747940063}, 
    {6.1718163490, -4.6379079819, 3.0544958115, 68.9314575195}, 
    {5.3536248207, -5.5441226959, 0.7952759862, 60.0310592651}, 
    {5.4337601662, -5.4639873505, 2.3032853603, 64.6860351562}, 
    {6.0201802254, -4.8775672913, 2.9782936573, 68.9034652710}, 
    {7.2088255882, -3.6889219284, 2.8629994392, 73.7720718384}, 
    {5.3061728477, -5.5915746689, 3.6379930973, 72.6561737061}, 
    {7.0155916214, -3.5459136963, 2.3374662399, 67.2557754517}, 
    {7.6020116806, -2.9594936371, 3.0124745369, 75.6241912842}, 
    {8.7906570435, -1.7708482742, 2.8971803188, 88.8052062988}, 
    {6.8880043030, -3.6735010147, 3.6721739769, 74.4240722656}, 
    {8.3776359558, -2.1259860992, 2.3553977013, 80.2524948120}, 
    {9.5662813187, -0.9373407364, 2.2401032448, 97.4104080200}, 
    {7.6636285782, -2.8399934769, 3.0150971413, 75.8875808716}, 
    {10.5662345886, 0.3695564270, 2.8683333397, 120.0092163086}, 
    {8.6635818481, -1.5330963135, 3.6433269978, 90.6818695068}, 
    {7.5730867386, -3.2432613373, -0.0125946999, 67.8705444336}, 
    {8.1595067978, -2.6568412781, 0.6624135971, 74.0751495361}, 
    {9.3481521606, -1.4681959152, 0.5471193790, 89.8428878784},
    {7.4454994202, -3.3708486557, 1.3221130371, 68.5460662842}
}; 

__global__ void multiple_dev (float (*d)[5])
{
    for (int i = 0; i < 42; i++) {
        d[i][4] = d[i][0]*d[i][0] + d[i][1]*d[i][1] + d[i][2]*d[i][2];
    }
}

void multiple_host (float (*d)[5])
{
    for (int i = 0; i < 42; i++) {
        d[i][3] = d[i][0]*d[i][0] + d[i][1]*d[i][1] + d[i][2]*d[i][2];
    }
}


int main()
{   
    // distance square in CPU
    multiple_host(data);

    // distance square in GPU
    float (*dev_data) [5]; 
    cudaMalloc( (void**)&dev_data, sizeof(data) );
    cudaMemcpy( dev_data,    data, sizeof(data), cudaMemcpyHostToDevice);

    multiple_dev <<<1, 1>>> (dev_data);

    cudaMemcpy( data, dev_data, sizeof(data), cudaMemcpyDeviceToHost);

    // check the results from CPU and GPU
    for (int i = 0; i < 42; i++) {
        if (data[i][3] != data[i][4]) {
            printf ("DIFF(%02d) / %.10f %.10f %.10f : CPU(%.10f) GPU(%.10f) \n"
                , i, data[i][0], data[i][1], data[i][2], data[i][3], data[i][4]); 
        }
    }

    return 0;

}

mnicely · February 7, 2020, 6:50pm

The difference is due to FMA. If you disable the FMA on the GPU (-fmad=false), you will get matching results.

njuffa · February 7, 2020, 6:56pm

[I was still typing when mnicely added post #2; please excuse any redundancy]

Your computation is a sum of products, i.e. a dot product. This is a prime application for the use of FMA (fused multiply-add), the basic arithmetic building block of a GPU. Most modern CPUs also support the FMA operation, e.g. Intel CPUs since Haswell.

Its use improves performance and, on average, improves the accuracy of computation since it reduces overall rounding error and provides some protection against subtractive cancellation.

By default, the CUDA compiler tries to utilize FMA as much as possible, by looking for FADD dependent on FMUL, then contracting the pair into a single FMA. Many host compilers can perform the contraction where the hardware supports it, but don’t do so by default.

So you can either turn off FMA contraction for the CUDA code with the compiler switch -fmad=false, or turn on FMA generation for the host compiler, and this will likely give you bit-wise identical results in this particular case. My recommendation would be to use the higher-accuracy, higher-performance approach and turn on FMA-generation throughout.

Note that there are several reasons floating-point computations may not match between two platforms, independent of the use of GPUs. It is simply a reality of dealing with floating-point computations. So instead of expecting and checking for bit-wise identical results, it is better to assess overall accuracy by comparing to a higher-precision reference. NVIDIA provides a (slightly dated) whitepaper that explains the most important issues: https://docs.nvidia.com/cuda/floating-point/index.html

Topic		Replies	Views
Why accuracy CPU and GPU not equal? CUDA Programming and Performance	6	10902	October 28, 2014
CPU and GPU floating point calculations Results are different CUDA Programming and Performance	6	21900	August 7, 2010
discrepancy between CPU and GPU after a division (accuracy issue) CUDA Programming and Performance	3	1478	June 10, 2015
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6529	February 19, 2009
Unacceptably low accuracy of calculations on the GPU CUDA Programming and Performance cuda	12	1217	May 15, 2024
Cube computing difference in GPU and CPU? CUDA Programming and Performance	4	510	November 1, 2017
GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places) CUDA Programming and Performance	22	717	August 9, 2023
Mythical Tflops CUDA Programming and Performance	11	1061	January 14, 2019
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	670	April 4, 2017
precision CUDA Programming and Performance	3	2614	December 16, 2008

Why are the calculations different between CPU and GPU?

Related topics