Discrepancy between the powf() return values in host and device

Hi,

The sample code below uses powf() functions in host and device. I guess that the results are different due to the device powf() function with 8 maximum ULP error.

Is there a way to produce same results in device? (fractional exponents are required)

Thank you for your time.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

__global__ void powKernel(float *a, float *b, const int size)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;

	if (i < size) {
		b[i] = powf(a[i], 1.67f);
	}
}

int main()
{
	const int size = 10000;
	float *h_a, *h_b, *h_c, *d_a, *d_b;
	cudaError_t cudaStatus;	

	cudaStatus = cudaSetDevice(0);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
		return 1;
	}

	h_a = (float *)malloc(size * sizeof(float));
	h_b = (float *)malloc(size * sizeof(float));
	h_c = (float *)malloc(size * sizeof(float));

	srand(time(NULL));

	for (int i = 0; i < size; i++) {
		h_a[i] = ((float)rand() / (float)RAND_MAX) * 100.0f;
		h_b[i] = powf(h_a[i], 1.67f);
	}

	cudaMalloc(&d_a, size * sizeof(float));
	cudaMalloc(&d_b, size * sizeof(float));

	cudaMemcpy(d_a, h_a, size * sizeof(float), cudaMemcpyHostToDevice);

	powKernel<<<(size / 1024) + 1, 1024>>>(d_a, d_b, size);

	cudaMemcpy(h_c, d_b, size * sizeof(float), cudaMemcpyDeviceToHost);

	int count = 0;
	for (int i = 0; i < size; i++) {
		if (h_b[i] != h_c[i]) {
			printf("%f vs %f\n", h_b[i], h_c[i]);
			count++;
		}
	}

	printf("Total: %d, wrong: %d\n", size, count);

    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    free(h_a);
    free(h_b);
    free(h_c);
    cudaFree(d_a);
    cudaFree(d_b);

    return 0;
}

Except for specialized libraries that return correctly-rounded results for standard math functions, no two math libraries will return bit-identical results. That is independent of the use of GPUs. For example, the math libraries of gcc, MSVC, the Intel compiler, and the PGI compiler (to name just four) are all going to return different results for some arguments to some functions. That is the basic reality of floating-point computation.

Some host tool-chain math libraries implement single-precision math functions as wrappers around the double-precision variant and that results in almost correctly-rounded implementations. You could do the same manually by invoking the double-precision version of pow(), but the resulting performance penalty could be quite significant.

BTW, is your intended exponent really 1.67, or actually 5/3 (= 1.666666…)? If it is the latter, you may want to try computing

float t = a[i];
float r = cbrtf(t);   
b[i] = r*r*x;

which likely provides higher performance and better accuracy than calls to powf().

As a general guide for any numerical algorithm, if your code is depending on the last bits of precision of any floating point value or computation, you need to redesign your algorithm to remove that sensitivity. (This is more of a mathematical redesign than a programming redesign.) Or, as an easier but inefficient quick fix, use double precision.

Thank you all for detailed replies. I’m converting a part of old Fortran scientific program to utilize GPU. So it’s important to produce results similar as possible as the original code.

In the Fortran code, the exponent is 1.67. I’ll try your suggestion to see the difference is acceptable.

Thanks!

Be aware that changing from an exponent of 1.67f to one of 1.66666666f is going to cause much bigger differences in the results than the differences caused by different library implementations of powf().

By the same token, a call to cbrt() is generally more accurate than computing x**(1.0/3.0) in Fortran. See my answer to a question on Stackoverflow for an explanation: [url]javascript - Why does Math.cbrt(1728) produce a more accurate result than Math.pow(1728, 1/3)? - Stack Overflow

Generally speaking, for accurate floating-point computations, Fortran is typically an inferior language to C/C++, now that the latter incorporate pretty much all features of IEEE-754.

Thanks again for all the replies. I have a follow up question.

In regards to the comments:

“…the math libraries of gcc, MSVC, the Intel compiler, and the PGI compiler (to name just four) are all going to return different results for some arguments to some functions.”

and

“… Fortran is typically an inferior language to C/C++, now that the latter incorporate pretty much all features of IEEE-754.”

Is there any source (online or otherwise) that is recommended that compares and contrasts the floating point differences between compilers and/or Fortran and C/C++, or is it a matter of looking at which components of the IEEE-754 standard are implemented by each compiler?

Many thanks,

Daeyoun

NVIDIA issued a whitepaper that addresses some of the issues people commonly encounter when moving code from x86 CPUs to GPUs:

[url]http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf[/url]

Intel provides a whitepaper discussing floating-point consistency issues for their software products:

[url]Intel Developer Zone

A fairly generic discussion is provided by this paper:

David Monniaux, “The pitfalls of verifying floating-point computations” ([url]https://arxiv.org/pdf/cs/0701192.pdf[/url])

It is generally understood by users of floating-point that results will generally differ across platforms, compilers, and compiler switches, but since the details are changing and product specific, I don’t think there is any comprehensive overview available anywhere (at least I am not aware of any). Note that IEEE 754 just mandates the behavior of specific basic operations, not the entire functionality of math.h/cmath.

More enlightened vendors will at least give you a list of experimentally established error bounds for their math library implementations, but the data may be incomplete or inaccurate, and a generic bound doesn’t tell you where the worst case numerical errors are likely to occur.

The numerical issues I see with Fortran (as compared to C/C++) are basically:

(1) Misses important computational primitives, such as fma() or cbrt() that may be crucial to accurate computation

(2) The language specification gives compilers a lot of freedom to re-associate floating-point computation, as long as it is mathematically equivalent (most of these equivalences do not hold for finite-precision floating-point arithmetic, however).

(3) No first-class support for IEEE-754 floating-point environment, i.e. equivalent of fenv.h/cfenv in C/C++.

It should be noted that some C/C++ tool chains also have some of the issues listed under (2) and (3) when using default settings or due to bugs. Two examples: At default settings, the Intel C++ compiler applies Fortran-like optimizations to floating-point computations, you need /fp:strict to reign it in; gcc will happily optimize across calls to fesetenv(), destroying the intended semantics of setting rounding mode, for example [I believe I used the right flags to tell it not to, still I found it doing so]

Side remark: A more accurate implementation of powf() is likely possible without loss of performance. CUDA’s single-precision math functions were written before a single-precision FMA operation was provided by the hardware, and not all of them seem to have been updated to take full advantage of FMA.

The largest errors with powf() occur when the first argument is near 1.0, and the result is close to either overflow or underflow boundaries. The error in the average case will be much smaller than what the CUDA documentation states.