Discrepancy between the powf() return values in host and device

Dyoun · November 30, 2016, 9:59pm

Hi,

The sample code below uses powf() functions in host and device. I guess that the results are different due to the device powf() function with 8 maximum ULP error.

Is there a way to produce same results in device? (fractional exponents are required)

Thank you for your time.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

__global__ void powKernel(float *a, float *b, const int size)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;

	if (i < size) {
		b[i] = powf(a[i], 1.67f);
	}
}

int main()
{
	const int size = 10000;
	float *h_a, *h_b, *h_c, *d_a, *d_b;
	cudaError_t cudaStatus;	

	cudaStatus = cudaSetDevice(0);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
		return 1;
	}

	h_a = (float *)malloc(size * sizeof(float));
	h_b = (float *)malloc(size * sizeof(float));
	h_c = (float *)malloc(size * sizeof(float));

	srand(time(NULL));

	for (int i = 0; i < size; i++) {
		h_a[i] = ((float)rand() / (float)RAND_MAX) * 100.0f;
		h_b[i] = powf(h_a[i], 1.67f);
	}

	cudaMalloc(&d_a, size * sizeof(float));
	cudaMalloc(&d_b, size * sizeof(float));

	cudaMemcpy(d_a, h_a, size * sizeof(float), cudaMemcpyHostToDevice);

	powKernel<<<(size / 1024) + 1, 1024>>>(d_a, d_b, size);

	cudaMemcpy(h_c, d_b, size * sizeof(float), cudaMemcpyDeviceToHost);

	int count = 0;
	for (int i = 0; i < size; i++) {
		if (h_b[i] != h_c[i]) {
			printf("%f vs %f\n", h_b[i], h_c[i]);
			count++;
		}
	}

	printf("Total: %d, wrong: %d\n", size, count);

    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    free(h_a);
    free(h_b);
    free(h_c);
    cudaFree(d_a);
    cudaFree(d_b);

    return 0;
}

njuffa · November 30, 2016, 10:26pm

Except for specialized libraries that return correctly-rounded results for standard math functions, no two math libraries will return bit-identical results. That is independent of the use of GPUs. For example, the math libraries of gcc, MSVC, the Intel compiler, and the PGI compiler (to name just four) are all going to return different results for some arguments to some functions. That is the basic reality of floating-point computation.

Some host tool-chain math libraries implement single-precision math functions as wrappers around the double-precision variant and that results in almost correctly-rounded implementations. You could do the same manually by invoking the double-precision version of pow(), but the resulting performance penalty could be quite significant.

BTW, is your intended exponent really 1.67, or actually 5/3 (= 1.666666…)? If it is the latter, you may want to try computing

float t = a[i];
float r = cbrtf(t);   
b[i] = r*r*x;

which likely provides higher performance and better accuracy than calls to powf().

SPWorley · November 30, 2016, 10:34pm

As a general guide for any numerical algorithm, if your code is depending on the last bits of precision of any floating point value or computation, you need to redesign your algorithm to remove that sensitivity. (This is more of a mathematical redesign than a programming redesign.) Or, as an easier but inefficient quick fix, use double precision.

Dyoun · November 30, 2016, 10:46pm

Thank you all for detailed replies. I’m converting a part of old Fortran scientific program to utilize GPU. So it’s important to produce results similar as possible as the original code.

In the Fortran code, the exponent is 1.67. I’ll try your suggestion to see the difference is acceptable.

Thanks!

njuffa · November 30, 2016, 11:15pm

Be aware that changing from an exponent of 1.67f to one of 1.66666666f is going to cause much bigger differences in the results than the differences caused by different library implementations of powf().

By the same token, a call to cbrt() is generally more accurate than computing x**(1.0/3.0) in Fortran. See my answer to a question on Stackoverflow for an explanation: [url]javascript - Why does Math.cbrt(1728) produce a more accurate result than Math.pow(1728, 1/3)? - Stack Overflow

Generally speaking, for accurate floating-point computations, Fortran is typically an inferior language to C/C++, now that the latter incorporate pretty much all features of IEEE-754.

Dyoun · December 1, 2016, 4:13pm

Thanks again for all the replies. I have a follow up question.

In regards to the comments:

“…the math libraries of gcc, MSVC, the Intel compiler, and the PGI compiler (to name just four) are all going to return different results for some arguments to some functions.”

and

“… Fortran is typically an inferior language to C/C++, now that the latter incorporate pretty much all features of IEEE-754.”

Is there any source (online or otherwise) that is recommended that compares and contrasts the floating point differences between compilers and/or Fortran and C/C++, or is it a matter of looking at which components of the IEEE-754 standard are implemented by each compiler?

Many thanks,

Daeyoun

njuffa · December 1, 2016, 5:05pm

NVIDIA issued a whitepaper that addresses some of the issues people commonly encounter when moving code from x86 CPUs to GPUs:

[url]http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf[/url]

Intel provides a whitepaper discussing floating-point consistency issues for their software products:

[url]Intel Developer Zone

A fairly generic discussion is provided by this paper:

David Monniaux, “The pitfalls of verifying floating-point computations” ([url]https://arxiv.org/pdf/cs/0701192.pdf[/url])

It is generally understood by users of floating-point that results will generally differ across platforms, compilers, and compiler switches, but since the details are changing and product specific, I don’t think there is any comprehensive overview available anywhere (at least I am not aware of any). Note that IEEE 754 just mandates the behavior of specific basic operations, not the entire functionality of math.h/cmath.

More enlightened vendors will at least give you a list of experimentally established error bounds for their math library implementations, but the data may be incomplete or inaccurate, and a generic bound doesn’t tell you where the worst case numerical errors are likely to occur.

The numerical issues I see with Fortran (as compared to C/C++) are basically:

(1) Misses important computational primitives, such as fma() or cbrt() that may be crucial to accurate computation

(2) The language specification gives compilers a lot of freedom to re-associate floating-point computation, as long as it is mathematically equivalent (most of these equivalences do not hold for finite-precision floating-point arithmetic, however).

(3) No first-class support for IEEE-754 floating-point environment, i.e. equivalent of fenv.h/cfenv in C/C++.

It should be noted that some C/C++ tool chains also have some of the issues listed under (2) and (3) when using default settings or due to bugs. Two examples: At default settings, the Intel C++ compiler applies Fortran-like optimizations to floating-point computations, you need /fp:strict to reign it in; gcc will happily optimize across calls to fesetenv(), destroying the intended semantics of setting rounding mode, for example [I believe I used the right flags to tell it not to, still I found it doing so]

njuffa · December 1, 2016, 5:21pm

Side remark: A more accurate implementation of powf() is likely possible without loss of performance. CUDA’s single-precision math functions were written before a single-precision FMA operation was provided by the hardware, and not all of them seem to have been updated to take full advantage of FMA.

The largest errors with powf() occur when the first argument is near 1.0, and the result is close to either overflow or underflow boundaries. The error in the average case will be much smaller than what the CUDA documentation states.

Topic		Replies	Views
Is CUDA's implementation of 64-bit floating precision in practice subpar to that of Fortran? CUDA Programming and Performance	2	1093	December 15, 2021
GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places) CUDA Programming and Performance	22	821	August 9, 2023
When using the __hfma function, the Device and Host results differ by 1 ULP, which should be 0ulp in theory CUDA Programming and Performance cuda , kernel	8	102	December 12, 2024
cufftExecR2C and cufftExecC2R API calls generates different results in different CUDA tool kit versions GPU-Accelerated Libraries cufft	1	1548	August 9, 2021
Floating Point Accuracy CUDA Programming and Performance	11	30431	April 6, 2013
wrong results in device CUDA Programming and Performance	8	1666	November 19, 2014
discrepancy between CPU and GPU after a division (accuracy issue) CUDA Programming and Performance	3	1510	June 10, 2015
precision CUDA Programming and Performance	3	2614	December 16, 2008
Why accuracy CPU and GPU not equal? CUDA Programming and Performance	6	10960	October 28, 2014
CPU and GPU floating point calculations Results are different CUDA Programming and Performance	6	21969	August 7, 2010

Discrepancy between the powf() return values in host and device

Related topics