Differences in single precision floating-point addition between sm_13 and sm_20?

danuk · December 7, 2010, 11:37am

I’m seeing some minor differences in single precision floating-point addition when using sm_13 or sm_20. I’m using nvcc 3.0 on Linux and the resulting ptx code is identical. Here’s a simple test app:

// kernel.cu:

union u32

{

  float f;

  int u;

};

__global__ void singlePrecisionTest(int* result)

{	

	u32 a;

	a.u = -2020276922;

	

	u32 b;

	b.u = 127206940;

	

	u32 c;

	c.f = a.f + b.f;

	

	result[0] = c.u;

}

extern "C"

void singlePrecisionTest()

{

	int h_result;

	int* d_result;

	

	cudaMalloc((void**) &d_result, sizeof(int));

	

	singlePrecisionTest<<<1, 1>>>(d_result);

		

	cudaMemcpy(&h_result, d_result, sizeof(int), cudaMemcpyDeviceToHost);

	

	printf("result = %d\n", h_result);

	

	cudaFree(d_result);

}

// main.cpp:

extern "C"

void singlePrecisionTest();

int main(int argc, char** argv)

{

	singlePrecisionTest();

	

	return 0;

}

Compiled like this:

nvcc -arch sm_20 -c -o kernel.cu.o kernel.cu

g++433 -c -o main.cpp.o main.cpp

g++433 -o app -Lcuda-3.0/lib64 -lcudart main.cpp.o kernel.cu.o

When I run this using sm_20 (as above), I get this:

result = 3506176

However, when running using sm_13, I get this:

result = 0

I’m aware there are minor differences in single precision floating point calculations for division, square root, etc, but I expected these simple additions would adhere to the 754 spec in the same way.

Can anyone either explain why this is the case or point me to a relevant article or document explaining this?

Thanks,

Dan

njuffa · December 8, 2010, 6:13am

Please note that type-punning via union invokes undefined behavior according to the C standard. I would recommend to reinterpret floats as ints, and ints as floats, via CUDA’s __int_as_float() and __float_as_int() functions. The resulting code in this case would look like:

int a = -2020276922;                

int b = 127206940;                

result[0] = __float_as_int(__int_as_float(a) + __int_as_float(b));

Alternatively, the reinterpret_cast known from C++ can be used.

One significant difference between sm_1x GPUs and sm_2x GPUs is that the latter have support for single-precision denormals. On an sm_1x GPU, any single-precision result of magnitude less than 2^-126 (1.175494351e-38) is converted to a zero with like sign. This is called the “flush to zero” response or FTZ for short. An sm_2x GPU gives the user a choice between flush to zero behavior or denormal support for single-precision arithmetic. The compiler enables denormal support by default when targetting sm_2x GPUs. The user can override this default choice with the --ftz flag.

Here, a reinterpreted as a float has the value -2.242212771e-34, and b reinterpreted as a float has the value 2.242261903e-34. The sum of these two floats is 4.913199044e-39, corresponding to 3506176 when reinterpreted as an int. The result is a denormal. Therefore, we would expect the following result:

for sm_1x: 0 // due to flush-to-zero

for sm_2x with --ftz=true: 0 // due to flush-to-zero

for sm_2x with --ftz=false: 3506176 // denormal support enabled (compiler default)

danuk · December 9, 2010, 10:53am

Thanks for the comprehensive answer, that was exactly the information I was after. I wasn’t aware that it was only the recent Fermi generation GPUs that support single precision denorms.

Topic		Replies	Views
Precision difference when using sm_13 with floats CUDA Programming and Performance	4	1844	May 10, 2010
floating point operations CUDA Programming and Performance	13	7053	May 16, 2010
Wrong results with -arch=sm_20 on a compute capability 2.0 GPU -arch=sm_13 and -arch=sm_20 does not CUDA Programming and Performance	5	10670	April 16, 2011
floating point precision CUDA Programming and Performance	3	1543	April 10, 2009
atomicAdd(float) does not add very small values CUDA Programming and Performance	7	2840	December 5, 2013
Are double precision functions in CUDA MATH API only the copy-paste version of single precision func CUDA Programming and Performance	4	2034	June 28, 2014
-arch sm_13 vs -arch sm_20 (sm_20 slower on C2050) CUDA Programming and Performance	21	7474	December 21, 2010
Precision of floats does CUDA use half precision instead of single precision for floats? CUDA Programming and Performance	5	2410	March 15, 2010
missing precision in floats? CUDA Programming and Performance	4	3939	April 15, 2009
single precision floating point error CUDA Programming and Performance	2	8795	January 14, 2009

Differences in single precision floating-point addition between sm_13 and sm_20?

Related topics